A Robust Audio-Visual Speech Enhancement Model
Wupeng Wang, Chao Xing, Dong Wang, Xiao Chen, Fengyu Sun
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 11:21
Most existing audio-visual speech enhancement (AVSE) methods work well in conditions with strong noise,however when applied to conditions with a medium SNR, serious performance degradations are often observed. These degradations can be partly attributed to the feature-fusion(early fusion etc.) architecture that tightly couples the audio information that is very strong and the visual information that is relatively weak. In this paper, we present a safe AVSE approach that can make the visual stream contribute to audio speech enhancment(ASE) safely in conditions of various SNRs by late fusion.The key novelty is two-fold: Firstly, we define power binary masks (PBMs) as a rough representation of speech signals. This rough representation admits the weakness of the visual information and so can be easily predicted from the visual stream. Secondly, we design a posterior augmentation architecture that integrate the visual-derived PBMs to the audioderived masks via a gating network. By this architecture, the entire performance is lower-bounded by the audio-based component. Our experiments on the Grid dataset demonstrated that this new approach consistently outperforms the audiobased system in all noise conditions, confirming that it is a safe way to incorporate visual knowledge in speech enhancement.