Deep Audio-Visual Fusion Neural Network For Saliency Estimation

Shunyu Yao, Xiongkuo Min, Guangtao Zhai

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:36

22 Sep 2021

In this work, we propose a deep audio-visual fusion model to estimate the saliency of videos. The model extracts visual and audio features with two separate branches and fuses them to generate the saliency map. We design a novel temporal attention module to utilize the temporal information and a spatial feature pyramid module to fuse the spatial information. Then a multi-scale audio-visual fusion method is used to integrate different modalities. Furthermore, we propose a new dataset for audio-visual saliency estimation. The proposed dataset consists of 202 high quality video squences with a large range of motions, scenes and object types. Many of the videos have high audio-visual correspondence. Several experiments are conducted on different datasets. The results demonstrate that our model outperforms the previous state-of-the-art methods by a large margin and the proposed dataset can serve as a new benchmark for the audio-visual saliency estimation task.

Tags:

signal processing society

IEEE icip 2021

september 19-22

virtual conference

2021

sps

virtual conference icip 2021

icip 2021

Deep Audio-Visual Fusion Neural Network For Saliency Estimation

Shunyu Yao, Xiongkuo Min, Guangtao Zhai

Value-Added Bundle(s) Including this Product

ICIP 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

Bundle: 2024 IEEE SustainTech Leadership Forum

Keynote: Navigating the Transition to Sustainable Energy Solutions in a Power-Hungry World

Panel: Leveraging Technology to Achieve Carbon Neutrality of Buildings and Factories

Join the IEEE Signal Processing Society