BEING GREEDY DOES NOT HURT: SAMPLING STRATEGIES FOR END-TO-END SPEECH RECOGNITION

Jahn Heymann, Egor Lakomkin, Leif Raedel

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:56

11 May 2022

Maximum Likelihood Estimation (MLE) is currently the most common approach to train large scale speech recognition systems. While it has significant practical advantages, MLE exhibits several drawbacks known in literature: training and inference conditions are mismatched and a proxy objective is optimized instead of word error rate. Recently, the Optimal Completion Distillation (OCD) training method was proposed which attempts to address some of those issues. In this paper, we analyze if the method is competitive over a strong MLE baseline and investigate its scalability towards large speech data beyond read speech, which to our knowledge is the first attempt known in literature. In addition, we propose and analyze several sampling strategies trading off exploration and exploitation of unseen prefixes and their effect on ASR accuracy. We conduct several experiments on both public LibriSpeech data and in-house large scale far-field data and compare models trained with MLE and OCD. Our proposed greedy sampling with soft targets approach proves most effective and yields a 9% rel. word error rate improvement over the i.i.d sampling. Finally, we note that OCD method improves over the MLE without label smoothing by 12%, and underperform by 6% once label smoothing is introduced to MLE.

Tags:

training criteria

speech recognition

non-maximum likelihood training