Pretraining Conformer with ASR for Speaker Verification
Danwei Cai (Duke university); Weiqing Wang (Duke University); Ming Li (Duke Kunshan University); Rui Xia (ByteDance AI Lab); Chuanzeng Huang (Speech, Audio and Music Intelligence (SAMI) group, ByteDance )
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. Conformer combines convolution neural network (CNN) and Transformer model for modeling local and global features, respectively. Recently, multi-scale feature aggregation Conformer (MFA-Conformer) has been proposed for automatic speaker verification. MFA-Conformer concatenates frame-level outputs from all Conformer blocks for further pooling. However, our experiments show that Conformer can be easily overfitted with limited speaker recognition training data. To avoid overfitting, we propose to transfer the knowledge learned from ASR to speaker verification. Specifically, an ASR pretrained Conformer is used to initialize the training of MFA-Conformer for speaker verification. Our experiments show that pretraining Conformer with ASR leads to significant performance gains across model sizes. The best model achieves 0.48%, 0.71% and 1.54% EER on Voxceleb1-O, Voxceleb1-E, and Voxceleb1-H, respectively.