CM-CS : CROSS-MODAL COMMON-SPECIFIC FEATURE LEARNING FOR AUDIO-VISUAL VIDEO PARSING
Hongbo Chen (ShanghaiTech University); Dongchen Zhu (SIMIT); Guanghui Zhang (SIMIT); Wenjun Shi (SIMIT); Xiaolin Zhang (SIMIT); Jiamao Li (SIMIT)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The weakly-supervised audio-visual video parsing (AVVP) task aims to parse duration and categories of each snippet when only the video-level event labels are provided. Most methods either leverage attention mechanisms to explore cross-modal and cross-video event semantics or alleviate label noise to improve performance. However, the distributional modality discrepancy caused by the heterogeneity of signals remains a significant challenge. To this end, we propose a novel cross-modal common-specific feature learning method (cm-CS) to map the modal features into modality-common and modality-specific subspaces. The former aims to capture similar high-level scene cue across different modalities, while the later attempts to capture specific cue. The proposed method is applied among and across in-visual 2D-3D modalities, audio-visual modalities, respectively. In addition, we design a training strategy to strengthen the learning of similarity and differences across modalities. Experiments show a large improvement of our method against existing works on the Look, Listen, and Parse (LLP) dataset (\textit{e.g.} from $58.9\%$ to $62.9\%$ in video-level visual metric).