Collaborative Audio-Visual Event Localization based on Sequential Decision and Cross-modal Consistency

Yuqian Kuang (Harbin Institute of Technology); Xiaopeng Fan (Harbin Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

We focus on the audio-visual event (AVE) localization task, which refers to locating the segments with AVE and identifying their event categories. Since different event-relevant video segments often describe different aspects of an AVE, they can complement each other. However, current approaches model the AVE localization task as a sequential classification process, through which event-relevant video segments cannot accurately collaborate with each other. Therefore, we propose the Collaborative Segments Decision (CSD) that can collaborate between event-relevant video segments by modeling the AVE localization task as a sequential decision process. In addition, to realize collaboration between cross-modal features, we propose the Consistent Feature Propagation (CFP) by exploiting their consistency over time. We propose the Collaborative Decision Network (CDN) by combining the above components. Experimental results show that CDN outperforms baseline methods in fully and weakly supervised settings.

Tags:

Image and video representation