Skip to main content

Benchmarking Lf-Mmi, Ctc And Rnn-T Criteria For Streaming Asr

Xiaohui Zhang, Frank Zhang, Chunxi Liu, Kjell Schubert, Julian Chan, Pradyot Prakash, Jun Liu, Ching-feng Yeh, Fuchun Peng, Yatharth Saraf, Geoffrey Zweig

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 0:13:53
19 Jan 2021

In this work, we perform comprehensive evaluations on automatic speech recognition (ASR) accuracy and efficiency with three popular training criteria for latency-controlled streaming ASR application: LF-MMI, CTC and RNN-T. In recognizing challenging social media videos of 7 languages, with training data sized from 3K to 14K hours, we conduct large-scale controlled experimentation across each training criterion with identical datasets and encoder model architecture, and found out that RNN-T models have consistent advantage in word error rates (WER) and CTC models have consistent advantage in inference efficiency measured by real-time factor (RTF). Additionally for different training criteria, we selectively examine various modeling strategies including modeling units, encoder architectures, pre-training, etc. To our best knowledge, this is the first comprehensive benchmark on these three widely-used ASR training criteria on real-world streaming ASR applications over multiple languages.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00