A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion

Wen-Chin Huang (Nagoya University); Shu-wen Yang (National Taiwan University); Tomoki Hayashi (Nagoya University); Tomoki Toda (Nagoya University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.

Tags:

Signal Processing for Communications and Networking

A Comparative Study of Self-Supervised Speech Representation Based Voice Conversion

Wen-Chin Huang (Nagoya University); Shu-wen Yang (National Taiwan University); Tomoki Hayashi (Nagoya University); Tomoki Toda (Nagoya University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Half-temporal and half-frequency attention U2Net for speech signal improvement

THE NERCSLIP-USTC SYSTEM FOR THE L3DAS23 CHALLENGE TASK2: 3D SOUND EVENT LOCALIZATION AND DETECTION (SELD)

A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

Join the IEEE Signal Processing Society