Implicit Attention-based Cross-modal Collaborative Learning for Action Recognition

Jianghao Zhang, Xian Zhong, Wenxuan Liu, Kui Jiang, Zhengwei Yang, Zheng Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 10 Oct 2023

Human action recognition is an active research topic in recent years. Multiple modalities often convey heterogeneous but potentially complementary action information that single modality does not hold. Some efforts have been resoted to explore cross-modal representation to promote the modeling capability, but with limited improvement due to the simple fusion of different modalities. To this end, we propose an impliCit attention-based Cross-modal Collaborative Learning (C3L) for action recognition. Specifically, we apply a Modality Generalization network with Grayscale enhancement (MGG) to learn specific modality representation and interaction (infrared and RGB). Then, we construct a unified representation space through the Uniform Modality Representation module (UMR), which preserves the modality information while enhancing the overall representation ability. Finally, feature extractors adaptively leverage modality-specific knowledge to realize cross-modal collaborative learning. Extensive experiments conducted on three widely-used public benchmarks InfAR, HMDB51, and UCF101, demonstrate the effectiveness and strength of our proposed method.

Tags:

action recognition

Modality generalization

collaborative learning

Uniform modality representation

Implicit Attention-based Cross-modal Collaborative Learning for Action Recognition

Jianghao Zhang, Xian Zhong, Wenxuan Liu, Kui Jiang, Zhengwei Yang, Zheng Wang

More Like This

SKELETON ACTION RECOGNITION BASED ON SPATIO-TEMPORAL FEATURES

FEATURE SPACE DATA AUGMENTATION FOR VIEWPOINT-ROBUST ACTION RECOGNITION IN VIDEOS

Multi-scale temporal feature fusion for few-shot action recognition

Join the IEEE Signal Processing Society