SQuId: Measuring Speech Naturalness in Many Languages

Thibault Sellam (Google); Ankur Bapna (Google Research); Joshua Camp (Google); Diana Mackinnon (Google); Ankur Parikh (Google); Jason Riesa (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Much of text-to-speech research relies on human evaluation. This incurs heavy costs and slows down the development process, especially in heavily multilingual applications where recruiting and polling annotators can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales - the largest effort of this type to date. The main insight is that training one model on many locales consistently surpasses mono-locale baselines. We show that the model outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, for which there is no fine-tuning data. We highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of model size and pre-training diversity with ablation experiments.

Tags:

Machine learning methods for language

SQuId: Measuring Speech Naturalness in Many Languages

Thibault Sellam (Google); Ankur Bapna (Google Research); Joshua Camp (Google); Diana Mackinnon (Google); Ankur Parikh (Google); Jason Riesa (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

Egocentric Action Anticipation for Personal Health

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

Join the IEEE Signal Processing Society