CM-CS : CROSS-MODAL COMMON-SPECIFIC FEATURE LEARNING FOR AUDIO-VISUAL VIDEO PARSING

Hongbo Chen (ShanghaiTech University); Dongchen Zhu (SIMIT); Guanghui Zhang (SIMIT); Wenjun Shi (SIMIT); Xiaolin Zhang (SIMIT); Jiamao Li (SIMIT)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

The weakly-supervised audio-visual video parsing (AVVP) task aims to parse duration and categories of each snippet when only the video-level event labels are provided. Most methods either leverage attention mechanisms to explore cross-modal and cross-video event semantics or alleviate label noise to improve performance. However, the distributional modality discrepancy caused by the heterogeneity of signals remains a significant challenge. To this end, we propose a novel cross-modal common-specific feature learning method (cm-CS) to map the modal features into modality-common and modality-specific subspaces. The former aims to capture similar high-level scene cue across different modalities, while the later attempts to capture specific cue. The proposed method is applied among and across in-visual 2D-3D modalities, audio-visual modalities, respectively. In addition, we design a training strategy to strengthen the learning of similarity and differences across modalities. Experiments show a large improvement of our method against existing works on the Look, Listen, and Parse (LLP) dataset (\textit{e.g.} from $58.9\%$ to $62.9\%$ in video-level visual metric).

Tags:

Multimedia analysis and synthesis

CM-CS : CROSS-MODAL COMMON-SPECIFIC FEATURE LEARNING FOR AUDIO-VISUAL VIDEO PARSING

Hongbo Chen (ShanghaiTech University); Dongchen Zhu (SIMIT); Guanghui Zhang (SIMIT); Wenjun Shi (SIMIT); Xiaolin Zhang (SIMIT); Jiamao Li (SIMIT)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

TWO-STREAM JOINT-TRAINING FOR SPEAKER INDEPENDENT ACOUSTIC-TO-ARTICULATORY INVERSION

Code-Switching Speech Synthesis Based on Self- Supervised Learning and Domain Adaptive Speaker Encoder

Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing

Join the IEEE Signal Processing Society