Benchmarking Lf-Mmi, Ctc And Rnn-T Criteria For Streaming Asr

Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 0:13:53

19 Jan 2021

In this work, we perform comprehensive evaluations on automatic speech recognition (ASR) accuracy and efficiency with three popular training criteria for latency-controlled streaming ASR application: LF-MMI, CTC and RNN-T. In recognizing challenging social media videos of 7 languages, with training data sized from 3K to 14K hours, we conduct large-scale controlled experimentation across each training criterion with identical datasets and encoder model architecture, and found out that RNN-T models have consistent advantage in word error rates (WER) and CTC models have consistent advantage in inference efficiency measured by real-time factor (RTF). Additionally for different training criteria, we selectively examine various modeling strategies including modeling units, encoder architectures, pre-training, etc. To our best knowledge, this is the first comprehensive benchmark on these three widely-used ASR training criteria on real-world streaming ASR applications over multiple languages.

Tags:

sps conference

slt 2021

Benchmarking Lf-Mmi, Ctc And Rnn-T Criteria For Streaming Asr

Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

Value-Added Bundle(s) Including this Product

SLT 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society