A LIGHTWEIGHT FOURIER CONVOLUTIONAL ATTENTION ENCODER FOR MULTI-CHANNEL SPEECH ENHANCEMENT
Siyu Sun (Wuhan University); Jian Jin (RTC Lab, ByteDance); Zhe Han (RTC Lab, ByteDance); Xianjun Xia (RTC Lab, ByteDance); Li Chen (ByteDance ); Yijian Xiao (RTC Lab, ByteDance); Piao Ding (RTC Lab, ByteDance); Shenyi Song (RTC Engineering, ByteDance); Roberto Togneri (The University of Western Australian); Haijian Zhang (Wuhan University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Beamforming weights prediction via deep neural networks has been one of the main methods in multi-channel speech enhancement tasks. The spectral-spatial cues are crucial in beamforming weights estimation, however, many existing works fail to optimally predict the beamforming weights with an absence of adequate spectral-spatial information learning. To tackle this challenge, we propose a Fourier convolutional attention encoder (FCAE) to provide a global receptive field over the frequency axis and boost the learning of spectral contexts and cross-channel features. Besides, a new convolutional recurrent encoder-decoder (CRED) structure is proposed in this work, within which FCAEs, attention blocks with skip connections and a deep feedback sequential memory network (DFSMN) serving as recurrent module are involved. The proposed CRED structure is exploited to capture the spectral-spatial joint information to obtain accurate estimation of beamforming weights. Experimental results demonstrate the superiority of the proposed approach with only 0.74M parameters and a PESQ improvement from 2.225 to 2.359 on the ConferencingSpeech2021 challenge development test set.