Tutorial: Foundational Problems in Neural Speech Recognition
Ehsan Variani, Georg Heigold, Ke Wu, Michael Riley
-
SPS
IEEE Members: $22.00
Non-members: $30.00
The first part of this talk focuses on the mathematical modeling of the existing neural ASR criteria. We introduce a modular framework that can explain all the existing criteria such as: Cross Entropy (CE), Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), Hybrid Autoregressive Transducer (HAT) and Listen, Attend and Spell (LAS). We also introduce the LAttice-based Speech Transducer library (LAST) which provides efficient implementation of these criteria and allows the user mix and match different components to create new training criterion. A simple colab is presented to engage the audience by using LAST and implementing a simple ASR model on a digit recognition task.
The second half of the talk focuses on some practical problems in ASR modeling and some principled solutions. The problems are:
Language model integration: mainly focuses on principled ways of adding language models within noisy channel formulation of ASR. We introduce ways to estimate internal language models for different ASR models and approaches to integrate external language models during first-pass decoding or second-pass rescoring.
Streaming ASR: We explain the main theoretical reason for why streaming ASR models perform much worse than their non-streaming counterparts and present two solutions. The main focus will be on the problem of label bias, and how local normalization assumption in the existing ASR training criteria has signified it. Finally we also present a way to measure modeling latency and how to optimize models with this respect.
Time alignment: how to improve time alignment of ASR models is the main question of this section is trying to answer. Furthermore, how the solution can lead to simpler ASR decoding methods.
Speech representation: what features can be extracted from an ASR systems for down-stream tasks which preserve the following properties: A) Back-propagation: the down-stream model can fine tune the up-stream ASR model if the pairwise data exist, B) Robust: changing the up-stream ASR system does not require retraining of the down-stream model. We will present several speech representations with such properties.
Semi-supervised training: how to extend the supervised training criteria to take advantage of unlabeled speech and text data. We show detailed formulation of the semi-supervised criteria and present several experimental results.
For all the problems above, the audiences will have a chance to use the LAST library and the colab to evaluate the effectiveness of the solutions themselves during the tutorial.