Incorporating lip features into audio-visual multi-speaker DOA estimation by gated fusion

Ya Jiang (University of Science and Technology of China); Hang Chen (USTC); Jun Du (University of Science and Technology of China); Qing Wang (University of Science and Technology of China); Chin-Hui Lee (Georgia Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

The audio-visual direction of arrival (DOA) estimation has demonstrated superior performance recently. In this paper, we present a novel audio-visual multi-speaker DOA estimation network, which for the first time incorporates multi-speaker lip features to adapt the complex overlapping and noisy scenarios. Firstly, we encode the multi-channel audio features, the reference angles and the lip Regions of Interest (RoIs) detected from the video respectively to acquire high-level representations. Then the multi-modal embeddings of audio, speaker angles and lips are fused by a tri-modal gated fusion module to balance their contributions to the output. The fused embedding is sent to the backend network to obtain the accurate DOA estimation with the combination of the predicted speaker angular vectors and the speaker activities. Experimental results show that our proposed approach can reduce the localization error by 73.48% compared to the previous work on the 2021 Multi-modal Information based Speech Processing (MISP) challenge AVSR corpus. Meanwhile, the high accuracy and stability of localization results demonstrate the proposed model's robustness in multi-speaker scenarios.

Tags:

Audio for multimedia and audio processing systems