Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages
Kaushal Bhogale (Indian Institute of Technology, Madras); Abhigyan Raman (AI4Bharat); Tahir Javed (Indian Institute of Technology Madras); Sumanth Doddapaneni (Robert Bosch Centre for Data Science and AI); Anoop Kunchukuttan (Microsoft); Pratyush Kumar (Indian Institute of Technology Madras); Mitesh M. Khapra (Indian Institute of Technology Madras)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Collecting labelled datasets for speech recognition systems for low-resource languages on a diverse set of domains and speakers is expensive. In this work, we demonstrate an inexpensive and effective alternative by 'mining' text and audio pairs for Indian languages from public sources, specifically from the public archives of All India Radio. As a key component, we adapt the Needleman-Wunsch algorithm to align sentences with corresponding audio segments given a long audio and a PDF of its transcript, while being robust to large errors due to OCR, extraneous text, and non-transcribed speech. We thus create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages totalling to 3.3M sentences. We establish the quality of Shrutilipi with 21 human evaluators across the 12 languages. We also establish the diversity of Shrutilipi in terms of represented regions, speakers, and mentioned named entities. Significantly, we show that adding Shrutilipi to the training dataset of ASR systems improves accuracy for both Wav2Vec and Conformer model architectures for 7 languages across benchmarks.