TranSTL: Spatial-Temporal Localization Transformer for Multi-Label Video Classification
Hongjun Wu, Mengzhu Li, Hongzhe Liu, Cheng Xu, Xuewei Li, Yongcheng Liu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:07:18
Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC tasks. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency(STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism but also fully captures spatial-temporal action dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.