Rule-Embedded Network For Audio-Visual Voice Activity Detection In Live Musical Video Streams

Yuanbo Hou, Yi Deng, Bilei Zhu, Zejun Ma, Dick Botteldooren

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:13:07

08 Jun 2021

Detecting anchor’s voice in live musical streams is an important preprocessing step for music and speech signal processing. Existing approaches to voice activity detection (VAD) primarily rely on audio, however, audio-based VAD is difficult to effectively focus on the target voice in noisy environments. This paper proposes a rule-embedded network to fuse the audio-visual (A-V) inputs for better detection of the target voice. The core role of the rule in the model is to coordinate the relation between the bi-modal information and use visual representations as a mask to filter out the information of non-target sound. Experiments show that: 1) with the help of cross-modal fusion using the proposed rule, the detection results of the A-V branch outperform that of the audio branch in the same model framework; 2) the performance of the bimodal A-V model far outperforms that of audio-only models, indicating that the incorporation of both audio and visual signals is highly beneficial for VAD. To attract more attention to the cross-modal music and audio signal processing, a new live musical video corpus with frame-level labels is introduced.

Chairs:

Dong Tian

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021