2D-To-2D Mask Estimation For Speech Enhancement Based On Fully Convolutional Neural Network
Yanhui Tu, Jun Du, Chin-Hui Lee
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 15:42
In recent years, the deep learning-based approaches are popular in the field of singe-channel speech enhancement. Convolutional neural networks (CNNs) are a standard component of many current speech enhancement system. In this study, we design a new Fully CNN (FCNN)-based regression model, which can directly achieve the 2-dimensional (2D) noisy lpg-power spectra (LPS) input to 2-dimensional (2D) time-frequency mask output mapping, denoted as 2D-RFCNN. First, the whole 2D noisy LPS of one utterance is directly used as network input to make sure each convolutional filter can see more contextual information. Second, we only use the pooling operation on the frequency bin to ensure that the final dimension of frequency bin has a value of 1 and make the number of feature mapping same to frequency dimension, simultaneously. Finally, we also use the deep convolutional layers with a small size of filter, which is popularly used in speech recognition, for speech enhancement. Experiments of the CHiME-4 challenge task shows that our proposed 2D-RFCNN model not only improves the speech quality (PESQ) and intelligibility (STOI), but also reduces the recognition error rate on real test set.