TranSTL: Spatial-Temporal Localization Transformer for Multi-Label Video Classification

Hongjun Wu, Mengzhu Li, Hongzhe Liu, Cheng Xu, Xuewei Li, Yongcheng Liu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:18

09 May 2022

Multi-label video classification (MLVC) is a long-standing and challenging research problem in video signal analysis. Generally, there exist many complex action labels in real-world videos and these actions are with inherent dependencies at both spatial and temporal domains. Motivated by this observation, we propose TranSTL, a spatial-temporal localization Transformer framework for MLVC tasks. In addition to leverage global action label co-occurrence, we also propose a novel plug-and-play Spatial Temporal Label Dependency(STLD) layer in TranSTL. STLD not only dynamically models the label co-occurrence in a video by self-attention mechanism but also fully captures spatial-temporal action dependencies using cross-attention strategy. As a result, our TranSTL is able to explicitly and accurately grasp the diverse action labels at both spatial and temporal domains. Extensive evaluation and empirical analysis show that TranSTL achieves superior performance over the state of the arts on two challenging benchmarks, Charades and Multi-Thumos.

Tags:

spatial temporal label dependency

transformer

label co-occurrence dependency

multi-label video classification

TranSTL: Spatial-Temporal Localization Transformer for Multi-Label Video Classification

Hongjun Wu, Mengzhu Li, Hongzhe Liu, Cheng Xu, Xuewei Li, Yongcheng Liu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Slides: Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

3D-CSL: SELF-SUPERVISED 3D CONTEXT SIMILARITY LEARNING FOR NEAR-DUPLICATE VIDEO RETRIEVAL

Join the IEEE Signal Processing Society