Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:26
04 May 2020

Most recent emotional speech synthesizers have been studied with a large training data. These systems require a sufficient number of audios to be recorded with respect to different emotions for each speaker. Acquiring emotional speech is more expensive than acquiring neutral speech because it requires professional acting ability to express natural emotional utterance in the voice recording environment. Thus, it would be economical, beneficial to transfer decent emotional prosody to neutral voice. We demonstrate our system can learn to speak the emotional speech of multiple speakers from their emotional audios, and transfer emotional prosody to the voice of a speaker who provides only neutral speech. Our system is a neural network architecture that synthesizes speech directly from text and emotion and speaker identifiers. This architecture is mainly composed of two components: modified Tacotron 2 and original WaveGlow. Tacotron 2 is a recurrent sequence-to-sequence network that maps character embeddings to mel-spectrograms; WaveGlow is a vocoder to synthesize time-domain waveforms from those spectrograms. The modified Tacotron 2 has been trained to synthesize speech from text depending on emotions and speakers by modification of injecting emotion and speaker encoding into the decoder part of Tacotron 2. This allows the system to learn to synthesize not only emotional speech of speakers with emotional audios but also that of a speaker without emotional audios. In this demo, audience can interactively enter any sentence to the speech synthesis system. Additionally, speech synthesis markup language (SSML) has been incorporated to easily control prosody of spoken input text. With the provided SSML, audience can manipulate emotion and speaker as well as three basic components: rate, volume, and pitch for fine-tuning. These three components are controllable at the character-level. The positions of characters spoken in the mel-spectrogram are estimated from which characters are highly attended to generate each mel-spectrogram.