Av(Se)²: Audio-Visual Squeeze-Excite Speech Enhancement
Michael Iuzzolino, Kazuhito Koishida
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:57
The goal of audio-visual speech enhancement (AVSE) is to supplement audio-only information with visual information, such as target speaker's lip movements, to improve the intelligibility and overall perceptual quality of noisy speech signals. We propose a new mechanism for audio-visual (AV) fusion that leverages a cross-modal squeeze-excitation (SE) block for speech enhancement: AV(SE)². The fusion block is adaptable to any feature layer of the audio and visual networks and significantly reduces model parameters as compared to standard AV fusion methods of channel-wise concatenation without loss of performance. We show that AV(SE)² with time-based gating across multiple feature layers outperforms baseline methods of single-point, channel-wise concatenated AV fusion on objective evaluations.