Modular Conformer Training for Flexible End-to-End ASR

Kartik Audhkhasi (Google); Brian Farris (Google); Bhuvana Ramabhadran (Google); Pedro J Moreno (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

The state-of-the-art conformer used in automatic speech recognition combines feed-forward, convolution and multi-headed self-attention layers in a single model that is trained end-to-end with a decoder network. While this end-to-end training is simple and beneficial for word error rate, it restricts the ability to perform inference with the model at different operating points of word error rate and latency. Existing approaches to overcome this limitation include cascaded encoders and variable attention context models. We propose an alternative approach, called Modular Conformer training, which splits the Conformer model into a backbone convolutional model and attention submodels, which are added at each layer. We conduct experiments with a few training techniques on the Librispeech and Librilight corpus. We show that dropping-out the attention layers during the training of the backbone model allows for the largest WER improvements upon adding fine-tuned attention submodels, without impacting the WER of the backbone model itself.

Tags:

Acoustic modeling for automatic speech recognition

Modular Conformer Training for Flexible End-to-End ASR

Kartik Audhkhasi (Google); Brian Farris (Google); Bhuvana Ramabhadran (Google); Pedro J Moreno (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Federated Self-Learning with Weak Supervision for Speech Recognition

Join the IEEE Signal Processing Society