Skip to main content

TranSTL: Spatial-Temporal Localization Transformer for Multi-Label Video Classification

Hongjun Wu, Mengzhu Li, Hongzhe Liu, Cheng Xu, Xuewei Li, Yongcheng Liu

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:18
09 May 2022

Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC tasks. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency(STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism but also fully captures spatial-temporal action dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00