Skip to main content

A Robust Audio-Visual Speech Enhancement Model

Wupeng Wang, Chao Xing, Dong Wang, Xiao Chen, Fengyu Sun

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 11:21
04 May 2020

Most existing audio-visual speech enhancement (AVSE) methods work well in conditions with strong noise,however when applied to conditions with a medium SNR, serious performance degradations are often observed. These degradations can be partly attributed to the feature-fusion(early fusion etc.) architecture that tightly couples the audio information that is very strong and the visual information that is relatively weak. In this paper, we present a safe AVSE approach that can make the visual stream contribute to audio speech enhancment(ASE) safely in conditions of various SNRs by late fusion.The key novelty is two-fold: Firstly, we define power binary masks (PBMs) as a rough representation of speech signals. This rough representation admits the weakness of the visual information and so can be easily predicted from the visual stream. Secondly, we design a posterior augmentation architecture that integrate the visual-derived PBMs to the audioderived masks via a gating network. By this architecture, the entire performance is lower-bounded by the audio-based component. Our experiments on the Grid dataset demonstrated that this new approach consistently outperforms the audiobased system in all noise conditions, confirming that it is a safe way to incorporate visual knowledge in speech enhancement.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00