Investigation into phone-based subword units for Multilingual end-to-end speech recognition

Saierdaer Yusuyin (Xinjiang University); Hao Huang (Xinjiang University); Junhua Liu (University of Science and Technology of China); Cong Liu (iFLYTEK Research)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Multilingual automatic speech recognition (ASR) models with phones as modeling units have have improved greatly in low-resource and similar-language scenarios, which benefits from shared representation across languages. Meanwhile, subwords have demonstrated their effectiveness for monolingual end-to-end recognition systems. In this paper, we investigate the use of phone-based subwords, specifically Byte Pair Encoding (BPE), as modeling units for multilingual end-to-end speech recognition. To explore the possibilities of phone-based BPE (PBPE) for multilingual ASR, we first use three types of multilingual BPE training methods for similar low-resource languages in Central Asia. Then, by adding three high-resource European languages to the experiments, we analyze language sharing degree in similar and low-resource scenarios. Finally, we propose a method to adjust the bigram statistics in the BPE algorithm and show that the PBPE representation leads to accuracy improvements in multilingual scenarios. The experiments show that PBPE outperforms phone, character and character-based BPE as output representation units. Particularly, the best PBPE model in multilingual experiments achieves a 25% relative improvement on a low-resource language compared to a character-based BPE system.

Tags:

Large vocabulary continuous speech recognition/search

Investigation into phone-based subword units for Multilingual end-to-end speech recognition

Saierdaer Yusuyin (Xinjiang University); Hao Huang (Xinjiang University); Junhua Liu (University of Science and Technology of China); Cong Liu (iFLYTEK Research)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ROBUST ACOUSTIC AND SEMANTIC CONTEXTUAL BIASING IN NEURAL TRANSDUCERS FOR SPEECH RECOGNITION

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

Large-scale Language Model Rescoring on Long-form Data

Join the IEEE Signal Processing Society