A MULTI-SCALE FEATURE AGGREGATION BASED LIGHTWEIGHT NETWORK FOR AUDIO-VISUAL SPEECH ENHANCEMENT
Haitao Xu ( University of Science and Technology of China); Liangfa Wei (Tencent); Jie Zhang (University of Science and Technology of China); Jianming Yang (Tsinghua University); Yannan Wang (Tencent); Tian Gao (University of Science and Technology of China); Xin Fang (iFlytek Research); Lirong Dai (University of Science and Technology of China)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Audio-visual speech enhancement (AVSE) was shown to be superior over conventional audio-only counterpart for improving the speech quality. However, most existing AVSE models are heavyweight in the sense of parameter amount, which is inappropriate for the deployment and practical applications. In this paper, we therefore present a lightweight AVSE approach (called M3Net) by incorporating several multi-modality, multi-scale and multi-branch strategies. Three multi-scale techniques are designed for the visual and audio streams, including multi-scale average pooling (MSAP), multi-scale ResNet (MSResNet) and multi-scale short time Fourier transform (MSSTFT). It is shown that each multi-scale module positively contributes to the performance. Also, we consider four skip connections for the audio-visual feature aggregation, which have a great complementary effect on the designed multi-scale techniques. Experimental results show that these techniques are flexible in combination with existing approaches, and more importantly obtain a comparable performance with a smaller model size compared to the heavyweight networks.