Modular Conformer Training for Flexible End-to-End ASR
Kartik Audhkhasi (Google); Brian Farris (Google); Bhuvana Ramabhadran (Google); Pedro J Moreno (Google)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The state-of-the-art conformer used in automatic speech recognition combines feed-forward, convolution and multi-headed self-attention layers in a single model that is trained end-to-end with a decoder network.
While this end-to-end training is simple and beneficial for word error rate, it restricts the ability to perform inference with the model at different operating points of word error rate and latency.
Existing approaches to overcome this limitation include cascaded encoders and variable attention context models.
We propose an alternative approach, called Modular Conformer training, which splits the Conformer model into a backbone convolutional model and attention submodels, which are added at each layer.
We conduct experiments with a few training techniques on the Librispeech and Librilight corpus.
We show that dropping-out the attention layers during the training of the backbone model allows for the largest WER improvements upon adding fine-tuned attention submodels, without impacting the WER of the backbone model itself.