StarGAN-VC based Cross-Domain Data Augmentation for Speaker Verification
Hang-Rui Hu (University of Science and Technology of China); Yan Song (USTC); Jian-Tao Zhang (University of Science and Technology of China); Lirong Dai (University of Science and Technology of China); Ian v McLoughlin (The University of Science and Technology of China); ZHU ZHUO (alibaba); Yu Zhou (alibaba); Yuhong Li (Alibaba); hui xue (Alibaba)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Automatic speaker verification (ASV) faces domain shift caused by the mismatch of intrinsic and extrinsic factors, such as recording device and speaking style, in real-world applications, which leads to severe performance degradation. Since single-speaker multi-condition (SSMC) data is difficult to collect in practice, existing domain adaptation methods are hard to ensure the feature consistency of the same class but different domains. To this end, we propose a cross-domain data generation method to obtain a domain-invariant ASV system. Inspired by voice conversion (VC) task, a StarGAN based generative model first learns cross-domain mappings from SSMC data, and then generates missing domain data for all speakers, thus increasing the intra-class diversity of the training set. Considering the difference between ASV and VC task, we renovate the corresponding training objectives and network structure to make the adaptation task-specific. Evaluations on achieve a relative performance improvement of about 5-8\% over the baseline in terms of minDCF and EER, outperforming the CNSRC winner's system of the equivalent scale.