MULTI-SCALE TEMPORAL FREQUENCY CONVOLUTIONAL NETWORK WITH AXIAL ATTENTION FOR MULTI-CHANNEL SPEECH ENHANCEMENT
Guochang Zhang, Chunliang Wang, Libiao Yu, Jianqiang Wei
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:10:49
Speech quality is often degraded by background noise and reverberation. Usually, a dense prediction network is used to reconstruct clean speech. In this work, a novel backbone for speech dense-prediction is proposed. After adjusting part of the input and output, this backbone is used for multi-channel speech enhancement task in this paper. To improve the performance of the backbone, strategies such as multi-channel phase encoder, multi-scale temporal frequency processing, axial self-attention, and two-stage masking are designed. Our proposed method is evaluated based on the datasets of ICASSP 2022 L3DAS22 Challenge. The experimental results show that the proposed method outperforms previous state-of-the-art baselines by a large margin and ranked second in L3DAS22 Challenge.The proposed backbone is also used for mono-channel speech enhancement and ranked first in both ICASSP 2022 AEC and DNS Challenges(non-personal track).