Skip to main content

Pretraining Conformer with ASR for Speaker Verification

Danwei Cai (Duke university); Weiqing Wang (Duke University); Ming Li (Duke Kunshan University); Rui Xia (ByteDance AI Lab); Chuanzeng Huang (Speech, Audio and Music Intelligence (SAMI) group, ByteDance )

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. Conformer combines convolution neural network (CNN) and Transformer model for modeling local and global features, respectively. Recently, multi-scale feature aggregation Conformer (MFA-Conformer) has been proposed for automatic speaker verification. MFA-Conformer concatenates frame-level outputs from all Conformer blocks for further pooling. However, our experiments show that Conformer can be easily overfitted with limited speaker recognition training data. To avoid overfitting, we propose to transfer the knowledge learned from ASR to speaker verification. Specifically, an ASR pretrained Conformer is used to initialize the training of MFA-Conformer for speaker verification. Our experiments show that pretraining Conformer with ASR leads to significant performance gains across model sizes. The best model achieves 0.48%, 0.71% and 1.54% EER on Voxceleb1-O, Voxceleb1-E, and Voxceleb1-H, respectively.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00