Exploiting speaker embeddings for improved microphone clustering and speech separation in ad-hoc microphone arrays
Stijn Kindt (UGent); Jenthe Thienpondt (IDLab, Ghent University); Nilesh Madhu (IDLab, Ghent University - imec)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
For separating sources captured by ad hoc distributed microphones a key first step is assigning the microphones to the appropriate source-dominated clusters. The features used for such (blind) clustering are based on a fixed length embedding of the audio signals in a high-dimensional latent space. In previous work, the embedding was hand-engineered from the Mel frequency cepstral coefficients and their modulation-spectra. This paper argues that embedding frameworks designed explicitly for the purpose of reliably discriminating between speakers would produce more appropriate features. We propose features generated by the state-of-the-art ECAPA-TDNN speaker verification model for the clustering. We benchmark these features in terms of the subsequent signal enhancement as well as on the quality of the clustering where, further, we introduce 2 intuitive metrics for the latter. Results indicate that in contrast to the hand-engineered features, the ECAPA-TDNN-based features lead to more logical clusters and better performance in the subsequent enhancement stages - thus validating our hypothesis.