nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Botao Zhao, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:28

12 May 2022

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

Tags:

conditional variational autoencoder

zero-shot

multi-speaker text-to-speech

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Botao Zhao, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ZERO-SHOT HYPERSPECTRAL IMAGE DENOISING WITH SELF-COMPLETION WITH PATTERNED MASKS

CLIPCAM: A Simple Baseline for Zero-shot Text-guided Object and Action Localization

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

Join the IEEE Signal Processing Society