Low-Latency Single Channel Speech Enhancement Using U-Net Convolutional Neural Networks
Ahmet E. Bulut, Kazuhito Koishida
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 15:24
Single-channel speech enhancement (SE) can be described, in its simplest terms, as learning a transformation from single-channel noisy speech to the clean speech. To do this, we propose a simple but effective U-Net convolutional neural network (CNN) based architecture with skip-connections with a focus on real-time applications which require low-latency processing. To that end, we choose to process relatively small temporal windows and apply time-frequency (T-F) featurization on it to achieve magnitude estimation. Two state-of-the-art systems are picked for bench-marking: One operating on spectral-domain [1] and the other on temporal-domain [2]. We evaluate the performance of the systems in terms of perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI). Experimental results show that in terms of PESQ measure the proposed method provides around 27% and 11% relative improvement over the baseline systems respectively and has significantly lower latency compared to them. We further investigate the trade-off between performance and overall latency of the proposed system.