SEPDIFF: SPEECH SEPARATION BASED ON DENOISING DIFFUSION MODEL
Bo Chen (Huawei Technologies); Chao Wu (Huawei Technologies); Wenbin Zhao (Huawei Technologies)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speech separation aims to extract multiple speech sources from mixed signals. In this paper, we propose SepDiff - a monaural speech separation method based on the denoising diffusion model (diffusion model). By modifying the diffusion and reverse process, we show that the diffusion model achieves an impressive performance on speech separation. To generate speech sources, we use mel spectrogram of the mixture as a condition in the training procedure and insert it in every step of the sampling procedure. We propose a novel DNN structure to leverage local and global speech information through successive feature channel attention and dilated 2-D convolution blocks on multi-resolution time-frequency features. We use a neural vocoder to get waveform from the generated mel spectrogram. We evaluate SepDiff on LibriMix datasets. Compared to SepFormer approach, SepDiff yields a higher mean opinion score (MOS) of 0.11.