IMPROVING NON-AUTOREGRESSIVE SPEECH RECOGNITION WITH AUTOREGRESSIVE PRETRAINING
Yanjia Li (Fano Labs); Lahiru T Samarakoon (Fano Labs, Hong Kong); Ivan Fung (Fano Labs)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Autoregressive (AR) automatic speech recognition (ASR) mod-
els predict each output token conditioning on the previous ones,
which slows down their inference speed. On the other hand, non-
autoregressive (NAR) models predict tokens independently and
simultaneously within a constant number of decoding iterations,
which brings high inference speed. However, NAR models gener-
ally have lower accuracy than AR models. In this work, we propose
AR pretraining to the NAR encoder to reduce the accuracy gap
between AR and NAR models. The experiment results show that
our AR-pretrained MaskCTC reaches the same accuracy as AR
Conformer on Aishell-1 (both 4.9% CER) and reduce the perfor-
mance gap with AR Conformer on LibriSpeech by relatively 50%.
Moreover, our AR-pretrained MaskCTC only needs single decoding
iteration, which reduces inference time by 50%. We also investigate
multiple masking strategies in training the masked language model
of MaskCTC.