Ensemble of Deep Neural Network Models for MOS Prediction

Marie Kunešová (University of West Bohemia); Jindrich Matousek (University of West Bohemia, Pilsen, Czech Republic); Jan Lehečka (University of West Bohemia); Jan Svec (University of West Bohemia); Josef Michalek (University of West Bohemia); Daniel Tihelka (University of West Bohemia); Martin Bulin (University of West Bohemia); Zdenek Hanzlicek (University of West Bohemia); Marketa Rezackova (University of West Bohemia)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Automatic evaluation of the quality of synthetic speech has the potential to serve as a cheaper and less time-consuming alternative to standard listening tests. In this paper, we present our contribution to the ongoing research: a system for automatic prediction of the mean opinion score (MOS) given by human listeners. The system was specifically developed for the recent VoiceMOS Challenge. Following the success of fusion systems in similar challenges, our contribution is an ensemble that interpolates the outputs of seven different models: four different wav2vec models, a CNN-RNN model, QuartzNet, and the LDNet baseline. During the VoiceMOS challenge, our system achieved the second-best utterance-level MSE of 0.171 and ranged from 2nd to 8th place among all 22 participating teams in terms of other evaluation metrics.

Tags:

Audio and speech quality and intelligibility measures

Ensemble of Deep Neural Network Models for MOS Prediction

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

SpeechLMScore: Evaluating speech generation using speech language model

LEARNING TO AUTO-CORRECT FOR HIGH-QUALITY SPECTROGRAMS

TORCHAUDIO-SQUIM: REFERENCE-LESS SPEECH QUALITY AND INTELLIGIBILITY MEASURES IN TORCHAUDIO

Join the IEEE Signal Processing Society