Data-Filtering Methods For Self-Training Of Automatic Speech Recognition Systems
Alexandru-Lucian Georgescu, Cristian Manolache, Dan Onea葲膬, Horia Cucu, Corneliu Burileanu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:10:39
Self-training is a simple and efficient way of leveraging unlabeled speech data: (i) start with a seed system trained on transcribed speech; (ii) pass the unlabeled data through this seed system to automatically generate transcriptions; (iii) en-large the initial dataset with the self-labeled data and retrain the speech recognition system. However, in order not to pollute the augmented dataset with incorrect transcriptions, an important intermediary step is to select those parts of the self-labeled data that are accurate. Several approaches have been proposed in the community, but most of the works address only a single method. In contrast, in this paper we inspect three distinct classes of data-filtering for self-training, leveraging: (i) confidence scores, (ii) multiple ASR hypotheses and (iii) approximate transcriptions. We evaluate these approaches from two perspectives: quantity vs. quality of the selected data and improvement of the seed ASR by including this data. The proposed methodology achieves state-of-the-art results on Romanian speech, obtaining 25% relative improvement over prior work. Among the three methods, approximate transcriptions bring the highest performance gain, even if they yield the least quantity of data.