Skip to main content

IMPROVING NON-AUTOREGRESSIVE SPEECH RECOGNITION WITH AUTOREGRESSIVE PRETRAINING

Yanjia Li (Fano Labs); Lahiru T Samarakoon (Fano Labs, Hong Kong); Ivan Fung (Fano Labs)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Autoregressive (AR) automatic speech recognition (ASR) mod- els predict each output token conditioning on the previous ones, which slows down their inference speed. On the other hand, non- autoregressive (NAR) models predict tokens independently and simultaneously within a constant number of decoding iterations, which brings high inference speed. However, NAR models gener- ally have lower accuracy than AR models. In this work, we propose AR pretraining to the NAR encoder to reduce the accuracy gap between AR and NAR models. The experiment results show that our AR-pretrained MaskCTC reaches the same accuracy as AR Conformer on Aishell-1 (both 4.9% CER) and reduce the perfor- mance gap with AR Conformer on LibriSpeech by relatively 50%. Moreover, our AR-pretrained MaskCTC only needs single decoding iteration, which reduces inference time by 50%. We also investigate multiple masking strategies in training the masked language model of MaskCTC.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00