CUMULATIVE ATTENTION BASED STREAMING TRANSFORMER ASR WITH INTERNAL LANGUAGE MODEL JOINT TRAINING AND RESCORING
Mohan LI (Toshiba Europe Ltd); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper presents an approach to improve the performance of streaming Transformer ASR by introducing an internal language model (ILM) as a part of the decoder layers. In the recently proposed cumulative attention (CA) based streaming ASR system, only the last or top few decoder layers are equipped with the CA module. Thus in this work, we propose to train the bottom (non-CA) layers as an ILM using an auxiliary LM loss jointly with the rest of the system. During inference, the outputs of the ILM are interpolated with those of the entire Transformer decoder as done in the conventional external language model (ELM) rescoring. The paper also proposes a refinement to the CA algorithm known as CTC look-ahead, in order to improve the precision of endpoint detection. Experiments conducted on AIShell-1, Aidatatang and Librispeech datasets show that the proposed ILM rescoring method achieves on par or better ASR performance when compared to the ELM rescoring baseline. Also, the CTC look-ahead strategy effectively alleviates the early end-of-speech (EOS) triggering issue suffered by the CA module, without bringing noticeable latency degradation.