Skip to main content

Ensemble prosody prediction for expressive speech synthesis

Tian Huey Teh (Papercup); Vivian Hu (Papercup); Devang Mohan (Papercup); Zack Hodari (Papercup); Christopher Wallis (Papercup); Tomás Gómez Ibarrondo (Papercup); Alexandra Torresquintero (Papercup); James Leoni (Papercup); Mark Gales (University of Cambridge ); Simon King (University of Edinburgh)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Generating expressive speech with rich and varied prosody continues to be a challenge for Text-to-Speech. Most efforts have focused on sophisticated neural architectures intended to better model the data distribution. Yet, in evaluations it is generally found that no single model is preferred for all input texts. This suggests an approach that has rarely been used before for Text-to-Speech: an ensemble of models. We apply ensemble learning to prosody prediction. We construct simple ensembles of prosody predictors by varying either model architecture or model parameter values. To automatically select amongst the models in the ensemble when performing Text-to-Speech, we propose a novel, and computationally trivial, variance-based criterion. We demonstrate that even a small ensemble of prosody predictors yields useful diversity, which, combined with the proposed selection criterion, outperforms any individual model from the ensemble.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00