StarGAN-VC based Cross-Domain Data Augmentation for Speaker Verification

Hang-Rui Hu (University of Science and Technology of China); Yan Song (USTC); Jian-Tao Zhang (University of Science and Technology of China); Lirong Dai (University of Science and Technology of China); Ian v McLoughlin (The University of Science and Technology of China); ZHU ZHUO (alibaba); Yu Zhou (alibaba); Yuhong Li (Alibaba); hui xue (Alibaba)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Automatic speaker verification (ASV) faces domain shift caused by the mismatch of intrinsic and extrinsic factors, such as recording device and speaking style, in real-world applications, which leads to severe performance degradation. Since single-speaker multi-condition (SSMC) data is difficult to collect in practice, existing domain adaptation methods are hard to ensure the feature consistency of the same class but different domains. To this end, we propose a cross-domain data generation method to obtain a domain-invariant ASV system. Inspired by voice conversion (VC) task, a StarGAN based generative model first learns cross-domain mappings from SSMC data, and then generates missing domain data for all speakers, thus increasing the intra-class diversity of the training set. Considering the difference between ASV and VC task, we renovate the corresponding training objectives and network structure to make the adaptation task-specific. Evaluations on achieve a relative performance improvement of about 5-8\% over the baseline in terms of minDCF and EER, outperforming the CNSRC winner's system of the equivalent scale.

Tags:

Speaker recognition/identification/diarization

StarGAN-VC based Cross-Domain Data Augmentation for Speaker Verification

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Moving Towards Non-Binary Gender Identification Via Analysis of System Errors in Binary Gender Classification

INCORPORATING UNCERTAINTY FROM SPEAKER EMBEDDING ESTIMATION TO SPEAKER VERIFICATION

Jeffreys divergence-based regularization of neural network output distribution applied to speaker recognition

Join the IEEE Signal Processing Society