JSV-VC: JOINTLY TRAINED SPEAKER VERIFICATION AND VOICE CONVERSION MODELS

Shogo Seki (NTT Corporation); Hirokazu Kameoka (NTT Communication Science Laboratories, NTT Corporation); Kou Tanaka (NTT corpration); Takuhiro Kaneko (NTT Corporation)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

This paper proposes a variational autoencoder (VAE)-based method for voice conversion (VC) on arbitrary source-target speaker pairs without parallel corpora, i.e., non-parallel any-to-any VC. One typical approach is to use speaker embeddings obtained from a speaker verification (SV) model as the condition for a VC model. However, converted speech is not guaranteed to reflect a target speaker's characteristics in a naive combination of VC and SV models. Moreover, speaker embeddings are not designed for VC problems, leading to suboptimal conversion performance. To address these issues, the proposed method, JSV-VC, trains both VC and SV models jointly. The VC model is trained so that converted speech is verified as the target speaker in the SV model, while the SV model is trained in order to output consistent embeddings before and after the VC model. The experimental evaluation reveals that JSV-VC outperforms conventional any-to-any VC methods quantitatively and qualitatively.

Tags:

Speech production, perception and psychoacoustics

JSV-VC: JOINTLY TRAINED SPEAKER VERIFICATION AND VOICE CONVERSION MODELS

Shogo Seki (NTT Corporation); Hirokazu Kameoka (NTT Communication Science Laboratories, NTT Corporation); Kou Tanaka (NTT corpration); Takuhiro Kaneko (NTT Corporation)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality

Acoustic Echo Cancellation Signal Processing Grand Challenge 2023

The First Pathloss Radio Map Prediction Challenge

Join the IEEE Signal Processing Society