ANY-TO-ANY VOICE CONVERSION WITH F0 AND TIMBRE DISENTANGLEMENT AND NOVEL TIMBRE CONDITIONING

Sudheer Kumar Kovela (Nvidia); Rafael Valle (NVIDIA); Ambrish Dantrey (Nvidia); Bryan Catanzaro (NVIDIA)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Despite recent advances in voice conversion (VC), it is still challenging to do real-time one-shot voice conversion with good control over timbre and $F_0$. In this work, we present a PPG-based VC model that directly decodes waveforms. We designed a speaker conditioned decoder based on HiFi-GAN\cite{kong2020hifi}, along with a new discriminator that produces high quality audio. Using an $F_0$ prenet and $F_0$ augmented speaker encoder, we are able to control $F_0$ and timbre independently with high fidelity. Our objective and subjective evaluations show that our method is preferred over others in terms of audio quality, timbre similarity and prosody retention.

Tags:

Speech production, perception and psychoacoustics