SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION

Yang Liu, Ying Tan, Haoyuan Lan

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 10 Oct 2023

The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on the Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.

Tags:

action recognition

self-supervised

audio-visual

contrastive learning

SELF-SUPERVISED CONTRASTIVE LEARNING FOR AUDIO-VISUAL ACTION RECOGNITION

Yang Liu, Ying Tan, Haoyuan Lan

More Like This

SIAMCLIM: TEXT-BASED PEDESTRIAN SEARCH VIA MULTI-MODAL SIAMESE CONTRASTIVE LEARNING

Hybrid Contrastive Prototypical Network for Few-Shot Scene Classification

SKELETON ACTION RECOGNITION BASED ON SPATIO-TEMPORAL FEATURES

Join the IEEE Signal Processing Society