QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER EMBEDDINGS FOR TARGET SPEAKER SEPARATION

Xiaoyu Liu (Dolby Laboratories); Xu Li (Dolby Laboratories); Joan Serra (Dolby Laboratories)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.

Tags:

Speech enhancement and separation

QUANTITATIVE EVIDENCE ON OVERLOOKED ASPECTS OF ENROLLMENT SPEAKER EMBEDDINGS FOR TARGET SPEAKER SEPARATION

Xiaoyu Liu (Dolby Laboratories); Xu Li (Dolby Laboratories); Joan Serra (Dolby Laboratories)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing Audio-Visual Speech Enhancement

Fast and Efficient Speech Enhancement with Variational Autoencoders

SINGLE-CHANNEL SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NETWORKS AND PROBABILISTIC LATENT SPACE MODELS

Join the IEEE Signal Processing Society