SPEAKER GENERATION

Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:11:05

12 May 2022

This work explores the task of synthesizing speech in non-existent human-sounding voices. We call this task ?speaker generation?, and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on the web.

Tags:

speaker embeddings

end-to-end tts

unseen speakers

speech synthesis

SPEAKER GENERATION

Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, David Kao

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Slides for: An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

GENERALIZATION ABILITY OF MOS PREDICTION NETWORKS

Join the IEEE Signal Processing Society