Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Metehan Cekic, Upamanyu Madhow, Ruirui Li, Zeya Chen, Yuguang Yang, Andreas Stolcke

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:11:16

08 May 2022

Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning, heavily depends on both clean and sufficient labeled data, which is always difficult to acquire. Noisy unlabelled data, on the other hand, also provides valuable information that can be exploited using self-supervised training methods. In this work, we investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart speaker devices. However, the supervisory information in such dialogues is inherently noisy, as multiple speakers may speak to a device in course of the same dialogue. To address this issue, we propose an effective rejection mechanism, which selectively learns from dialogues based on their acoustic homogeneity. Both reconstruction-based and contrastive learning-based self-supervised methods are compared. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work. Dialogue pretraining when combined with the rejection mechanism yields 27.10% equal error rate (EER) reduction in speaker recognition compared to a model without self-supervised pretraining.

Tags:

self-supervised training

speaker recognition

dialogue

rejection mechanism

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Metehan Cekic, Upamanyu Madhow, Ruirui Li, Zeya Chen, Yuguang Yang, Andreas Stolcke

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

UNSUPERVISED ANOMALY DETECTION WITH LOCAL-SENSITIVE VQVAE AND GLOBAL-SENSITIVE TRANSFORMERS

FINE-TUNING WAV2VEC2 FOR SPEAKER RECOGNITION

SPEAKER EMBEDDING CONVERSION FOR BACKWARD AND CROSS-CHANNEL COMPATIBILITY

Join the IEEE Signal Processing Society