DOMAIN ROBUST DEEP EMBEDDING LEARNING FOR SPEAKER RECOGNITION
Hang-Rui Hu, Yan Song, Ying Liu, Li-Rong Dai, Ian McLoughlin, Lin Liu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:10:17
This paper presents a domain robust deep embedding learning method for speaker verification (SV) tasks. Most recent methods utilize deep neural networks (DNN) to learn compact and discriminative speaker embeddings from large-scale labeled datasets such as VoxCeleb and the NIST SRE corpus. Despite the success of exiting methods, performance may degrade significantly for new target datasets, mainly due to the distribution discrepancy between training and test domains. Moreover, how corpora are collected, and the languages they contain differ, leading to them spanning multiple, perhaps mismatched, latent domains. To address this, a multi-task end-to-end framework is proposed to learn speaker embeddings from both labeled source and unlabeled target datasets. Motivated by label smoothing, a smoothed knowledge distillation (SKD) based self-supervised learning method is designed to exploit latent structural information from the unlabeled target domain. Furthermore, a domain-aware batch normalization (DABN) module aims to reduce the cross-domain distribution discrepancy, while a domain-agnostic instance normalization (DAIN) module aims to learn features that are robust to within-domain variance. Evaluation on NIST SRE16 demonstrates significant performance gains.