Skip to main content

PROSODY-AWARE SPEECHT5 FOR EXPRESSIVE NEURAL TTS

Yan Deng (Microsoft); Long Zhou (Microsoft Research Asia); Yuanhao Yi (Microsoft); Shujie Liu (Microsoft Research Asia); Lei He (Microsoft Cloud and AI)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

SpeechT5, a multimodal learning framework which explores encoder-decoder pre-training by leveraging both unlabeled speech and text, has been proven to be effective on a wide variety of speech processing tasks. In this paper, we enhance SpeechT5 by adding a new sub-task on prosody modeling (prosody-aware SpeechT5) for neural text-to-speech (TTS), which can improve the model capability to learn richer contextual representations through multi-task learning. In the prosody-aware SpeechT5 training framework, most modules in neural TTS can be pre-trained with large-scale unlabeled speech and text corpus, including encoder, decoder, and variance adaptor. Experimental results show that the proposed prosody-aware SpeechT5 is effective at improving the expressiveness of neural TTS: 1) the CMOS (comparison mean opinion score) gain is 0.154 for texts from news domain and 0.114 for texts from audiobook domain; 2) the prosody related issues in synthetic speech are reduced by 19.02% in subjective evaluation.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00