Learning To Transfer Multi-Speaker Emotional Prosody To A Neutral Speaker

Sungjae Cho, Sejik Park, Tae-Ho Kim, Soo-Young Lee

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 14:26

04 May 2020

Most recent emotional speech synthesizers have been studied with a large training data. These systems require a sufficient number of audios to be recorded with respect to different emotions for each speaker. Acquiring emotional speech is more expensive than acquiring neutral speech because it requires professional acting ability to express natural emotional utterance in the voice recording environment. Thus, it would be economical, beneficial to transfer decent emotional prosody to neutral voice. We demonstrate our system can learn to speak the emotional speech of multiple speakers from their emotional audios, and transfer emotional prosody to the voice of a speaker who provides only neutral speech. Our system is a neural network architecture that synthesizes speech directly from text and emotion and speaker identifiers. This architecture is mainly composed of two components: modified Tacotron 2 and original WaveGlow. Tacotron 2 is a recurrent sequence-to-sequence network that maps character embeddings to mel-spectrograms; WaveGlow is a vocoder to synthesize time-domain waveforms from those spectrograms. The modified Tacotron 2 has been trained to synthesize speech from text depending on emotions and speakers by modification of injecting emotion and speaker encoding into the decoder part of Tacotron 2. This allows the system to learn to synthesize not only emotional speech of speakers with emotional audios but also that of a speaker without emotional audios. In this demo, audience can interactively enter any sentence to the speech synthesis system. Additionally, speech synthesis markup language (SSML) has been incorporated to easily control prosody of spoken input text. With the provided SSML, audience can manipulate emotion and speaker as well as three basic components: rate, volume, and pitch for fine-tuning. These three components are controllable at the character-level. The positions of characters spoken in the mel-spectrogram are estimated from which characters are highly attended to generate each mel-spectrogram.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Learning To Transfer Multi-Speaker Emotional Prosody To A Neutral Speaker

Sungjae Cho, Sejik Park, Tae-Ho Kim, Soo-Young Lee

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society