Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:11:13
13 May 2022

Recent neural-based text-to-speech (TTS) models are able to produce highly natural speech. To synthesize expressive speech, the prosody of the speech has to be modeled, and predicted/controlled during synthesis. However, intuitive control over prosody remains elusive. Some techniques only allow control over the global style of the speech and do not allow fine-grained adjustments. Other techniques create fine-grained prosody embeddings, but these are difficult to manipulate to obtain a desired speaking style. We thus present ConEx, a novel model for expressive speech synthesis, which can produce speech in a certain speaking style, while also allowing local adjustments to the prosody of the generated speech. The model builds upon the non-autoregressive architecture of FastSpeech and includes a reference encoder to learn global prosody embeddings, and a vector quantized variational autoencoder to create fine-grained prosody embeddings. To realize prosody manipulation, a new interactive method is proposed. Experiments on two datasets show that the model enables multi-level prosody control.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00