CUMULATIVE ATTENTION BASED STREAMING TRANSFORMER ASR WITH INTERNAL LANGUAGE MODEL JOINT TRAINING AND RESCORING

Mohan LI (Toshiba Europe Ltd); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

This paper presents an approach to improve the performance of streaming Transformer ASR by introducing an internal language model (ILM) as a part of the decoder layers. In the recently proposed cumulative attention (CA) based streaming ASR system, only the last or top few decoder layers are equipped with the CA module. Thus in this work, we propose to train the bottom (non-CA) layers as an ILM using an auxiliary LM loss jointly with the rest of the system. During inference, the outputs of the ILM are interpolated with those of the entire Transformer decoder as done in the conventional external language model (ELM) rescoring. The paper also proposes a refinement to the CA algorithm known as CTC look-ahead, in order to improve the precision of endpoint detection. Experiments conducted on AIShell-1, Aidatatang and Librispeech datasets show that the proposed ILM rescoring method achieves on par or better ASR performance when compared to the ELM rescoring baseline. Also, the CTC look-ahead strategy effectively alleviates the early end-of-speech (EOS) triggering issue suffered by the CA module, without bringing noticeable latency degradation.

Tags:

Large vocabulary continuous speech recognition/search

CUMULATIVE ATTENTION BASED STREAMING TRANSFORMER ASR WITH INTERNAL LANGUAGE MODEL JOINT TRAINING AND RESCORING

Mohan LI (Toshiba Europe Ltd); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ROBUST ACOUSTIC AND SEMANTIC CONTEXTUAL BIASING IN NEURAL TRANSDUCERS FOR SPEECH RECOGNITION

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Join the IEEE Signal Processing Society