Mellotron: Multispeaker Expressive Voice Synthesis By Conditioning On Rhythm, Pitch And Global Style Tokens

Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 14:31

04 May 2020

Mellotron is a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data. By explicitly conditioning on rhythm and continuous pitch contours from an audio signal or music score, Mellotron is able to generate speech in a variety of styles ranging from read speech to expressive speech, from slow drawls to rap and from monotonous voice to singing voice. Unlike other methods, we train Mellotron using only read speech data without alignments between text and audio. We evaluate our models using the LJSpeech and LibriTTS datasets. We provide F0 Frame Errors and synthesized samples that include style transfer from other speakers, singers and styles not seen during training, procedural manipulation of rhythm and pitch and choir synthesis.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Mellotron: Multispeaker Expressive Voice Synthesis By Conditioning On Rhythm, Pitch And Global Style Tokens

Rafael Valle, Jason Li, Ryan Prenger, Bryan Catanzaro

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society