INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS
Tobias Cornille, Jessa Bekker, Fengna Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:13
Recent neural-based text-to-speech (TTS) models are able to produce highly natural speech. To synthesize expressive speech, the prosody of the speech has to be modeled, and predicted/controlled during synthesis. However, intuitive control over prosody remains elusive. Some techniques only allow control over the global style of the speech and do not allow fine-grained adjustments. Other techniques create fine-grained prosody embeddings, but these are difficult to manipulate to obtain a desired speaking style. We thus present ConEx, a novel model for expressive speech synthesis, which can produce speech in a certain speaking style, while also allowing local adjustments to the prosody of the generated speech. The model builds upon the non-autoregressive architecture of FastSpeech and includes a reference encoder to learn global prosody embeddings, and a vector quantized variational autoencoder to create fine-grained prosody embeddings. To realize prosody manipulation, a new interactive method is proposed. Experiments on two datasets show that the model enables multi-level prosody control.