Skip to main content

A MULTI-SCALE FEATURE AGGREGATION BASED LIGHTWEIGHT NETWORK FOR AUDIO-VISUAL SPEECH ENHANCEMENT

Haitao Xu ( University of Science and Technology of China); Liangfa Wei (Tencent); Jie Zhang (University of Science and Technology of China); Jianming Yang (Tsinghua University); Yannan Wang (Tencent); Tian Gao (University of Science and Technology of China); Xin Fang (iFlytek Research); Lirong Dai (University of Science and Technology of China)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

Audio-visual speech enhancement (AVSE) was shown to be superior over conventional audio-only counterpart for improving the speech quality. However, most existing AVSE models are heavyweight in the sense of parameter amount, which is inappropriate for the deployment and practical applications. In this paper, we therefore present a lightweight AVSE approach (called M3Net) by incorporating several multi-modality, multi-scale and multi-branch strategies. Three multi-scale techniques are designed for the visual and audio streams, including multi-scale average pooling (MSAP), multi-scale ResNet (MSResNet) and multi-scale short time Fourier transform (MSSTFT). It is shown that each multi-scale module positively contributes to the performance. Also, we consider four skip connections for the audio-visual feature aggregation, which have a great complementary effect on the designed multi-scale techniques. Experimental results show that these techniques are flexible in combination with existing approaches, and more importantly obtain a comparable performance with a smaller model size compared to the heavyweight networks.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00