Slogd: Speaker Location Guided Deflation Approach To Speech Separation
Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 11:53
Speech separation is the process of separating multiple speakers from an audio recording. In this work we propose to separate the sources using a Speaker LOcalization Guided Deflation (SLOGD) approach wherein we estimate the sources iteratively. In each iteration we first estimate the location of the speaker and use it to estimate a mask corresponding to the localized speaker. The estimated source is removed from the mixture before estimating the location and mask of the next source. Experiments are conducted on a reverberated, noisy multichannel version of the well-studied WSJ-2MIX dataset using word error rate (WER) as a metric. The proposed method achieves a WER of 44.2%, a 34% relative improvement over the system without separation and 17% relative improvement over Conv-TasNet.