BEING GREEDY DOES NOT HURT: SAMPLING STRATEGIES FOR END-TO-END SPEECH RECOGNITION
Jahn Heymann, Egor Lakomkin, Leif Raedel
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:56
Maximum Likelihood Estimation (MLE) is currently the most common approach to train large scale speech recognition systems. While it has significant practical advantages, MLE exhibits several drawbacks known in literature: training and inference conditions are mismatched and a proxy objective is optimized instead of word error rate. Recently, the Optimal Completion Distillation (OCD) training method was proposed which attempts to address some of those issues. In this paper, we analyze if the method is competitive over a strong MLE baseline and investigate its scalability towards large speech data beyond read speech, which to our knowledge is the first attempt known in literature. In addition, we propose and analyze several sampling strategies trading off exploration and exploitation of unseen prefixes and their effect on ASR accuracy. We conduct several experiments on both public LibriSpeech data and in-house large scale far-field data and compare models trained with MLE and OCD. Our proposed greedy sampling with soft targets approach proves most effective and yields a 9% rel. word error rate improvement over the i.i.d sampling. Finally, we note that OCD method improves over the MLE without label smoothing by 12%, and underperform by 6% once label smoothing is introduced to MLE.