Improving Naturalness And Controllability Of Sequence-To-Sequence Speech Synthesis By Learning Local Prosody Representations

Cheng Gong, Longbiao Wang, Zhenhua Ling, Shaotong Guo, Ju Zhang, Jianwu Dang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:10:09

08 Jun 2021

State-of-the-art neural text-to-speech (TTS) networks are trained with a large amount of speech data, which significantly improves the quality of synthetic speech compared with traditional approaches. However, the prosody and controllability of the generated speech is still insufficient, especially in tonal languages. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence or words. In this study, we extended Tacotron2 with a pitch prediction task to capture discrete pitch-related representations. Specifically, the learned pitch-related suprasegmental information is fed simultaneously with traditional character features into the decoder to generate final Mel spectrogram. Experiments show that the proposed method can improve the quality of the generated speech (mean opinion score of 4.37 vs. 4.22). Moreover, we demonstrated that we can easily achieve word-level pitch control during generation by changing local pitch-related representations before passing them to the decoder network.

Chairs:

Yu Zhang

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021