A MULTI-SCALE FEATURE AGGREGATION BASED LIGHTWEIGHT NETWORK FOR AUDIO-VISUAL SPEECH ENHANCEMENT

Haitao Xu ( University of Science and Technology of China); Liangfa Wei (Tencent); Jie Zhang (University of Science and Technology of China); Jianming Yang (Tsinghua University); Yannan Wang (Tencent); Tian Gao (University of Science and Technology of China); Xin Fang (iFlytek Research); Lirong Dai (University of Science and Technology of China)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Audio-visual speech enhancement (AVSE) was shown to be superior over conventional audio-only counterpart for improving the speech quality. However, most existing AVSE models are heavyweight in the sense of parameter amount, which is inappropriate for the deployment and practical applications. In this paper, we therefore present a lightweight AVSE approach (called M3Net) by incorporating several multi-modality, multi-scale and multi-branch strategies. Three multi-scale techniques are designed for the visual and audio streams, including multi-scale average pooling (MSAP), multi-scale ResNet (MSResNet) and multi-scale short time Fourier transform (MSSTFT). It is shown that each multi-scale module positively contributes to the performance. Also, we consider four skip connections for the audio-visual feature aggregation, which have a great complementary effect on the designed multi-scale techniques. Experimental results show that these techniques are flexible in combination with existing approaches, and more importantly obtain a comparable performance with a smaller model size compared to the heavyweight networks.

Tags:

Multimodal processing of language

A MULTI-SCALE FEATURE AGGREGATION BASED LIGHTWEIGHT NETWORK FOR AUDIO-VISUAL SPEECH ENHANCEMENT

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

Join the IEEE Signal Processing Society