Progressive Multi-Target Network Based Speech Enhancement With Snr-Preselection For Robust Speaker Diarization
Jun Du, Lei Sun, Xueyang Zhang, Tian Gao, Xin Fang, Chin-Hui Lee
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 15:24
In this paper, we design a novel front-end processing system for speaker diarization under realistic conditions with challenging background noises. To cope with diversified environments, we first extend our previously proposed progressive learning based speech enhancement model by adding multi-task learning in each intermediate layer. The corresponding progressive multi-target (PMT) in various layers includes both progressive ratio mask (PRM) and progressively enhanced log-power spectra (PELPS) with specified signal-to-noise-ratios (SNRs). Speech distortions are commonly introduced during the front-end processing, which often deteriorate the back-end performance. However, the proposed speech enhancement model can be regarded as a bagging of models with multiple learning objectives, which provides flexibility for selecting the most appropriate output for robust speaker diarization. In addition, a global SNR estimation is performed using the results of deep neural network (DNN) based speech activity detection (SAD) to decide whether the audio should be enhanced. We evaluate the speaker diarization performance on the second DIHARD dataset which includes several different realistic conditions. Compared with the original data, experiments demonstrate that the enhanced data processed by our proposed method can effectively avoid the performance loss of every single domain, and achieve consistent improvements in most domains.