Pretraining Conformer with ASR for Speaker Verification

Danwei Cai (Duke university); Weiqing Wang (Duke University); Ming Li (Duke Kunshan University); Rui Xia (ByteDance AI Lab); Chuanzeng Huang (Speech, Audio and Music Intelligence (SAMI) group, ByteDance )

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. Conformer combines convolution neural network (CNN) and Transformer model for modeling local and global features, respectively. Recently, multi-scale feature aggregation Conformer (MFA-Conformer) has been proposed for automatic speaker verification. MFA-Conformer concatenates frame-level outputs from all Conformer blocks for further pooling. However, our experiments show that Conformer can be easily overfitted with limited speaker recognition training data. To avoid overfitting, we propose to transfer the knowledge learned from ASR to speaker verification. Specifically, an ASR pretrained Conformer is used to initialize the training of MFA-Conformer for speaker verification. Our experiments show that pretraining Conformer with ASR leads to significant performance gains across model sizes. The best model achieves 0.48%, 0.71% and 1.54% EER on Voxceleb1-O, Voxceleb1-E, and Voxceleb1-H, respectively.

Tags:

Speaker verification and anti-spoofing

Pretraining Conformer with ASR for Speaker Verification

Danwei Cai (Duke university); Weiqing Wang (Duke University); Ming Li (Duke Kunshan University); Rui Xia (ByteDance AI Lab); Chuanzeng Huang (Speech, Audio and Music Intelligence (SAMI) group, ByteDance )

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Improving Transformer-Based Networks with Locality for Automatic Speaker Verification

Predictive SkiM: Contrastive Predictive Coding for Low-Latency Online Speech Separation

Leveraging Positional-Related Local-Global Dependency for Synthetic Speech Detection

Join the IEEE Signal Processing Society