NAVIGATING AUDIO-VISUAL EVENT DETECTION ACROSS MISMATCHED MODALITIES

Guangwei Li, Xuenan Xu, Mengyue Wu, Kai Yu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:45

09 May 2022

Previous audio-visual (AV) alignment mainly focuses on frame-level synchronization while neglecting clip-wise matching. We focus on AV parsing on fully unconstrained data where the audio and visual events do not necessarily co-present. A video-enhanced Audioset dataset is provided to investigate parsing on such a mismatching setting, with 376 events included. To our knowledge, this is the first time where AV event parsing and detection are inspected on a clip-wise matching scenario. Experiments show that our proposed method largely improves video parsing accuracy on tagging and detection. Further, a parsing model pretrained on our dataset can assist in accurately locating audio-visual syncing time spans.

Tags:

clip-level mismatch

multimodal

audio-visual event detection

weakly-supervised

NAVIGATING AUDIO-VISUAL EVENT DETECTION ACROSS MISMATCHED MODALITIES

Guangwei Li, Xuenan Xu, Mengyue Wu, Kai Yu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

LEVERAGING EFFICIENT TRAINING AND FEATURE FUSION IN TRANSFORMERS FOR MULTIMODAL CLASSIFICATION

Self-enhanced training framework for referring expression grounding

MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING

Join the IEEE Signal Processing Society