IMPROVE FEW-SHOT VOICE CLONING USING MULTI-MODAL LEARNING

Haitong Zhang, Yue Lin

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:09:46

13 May 2022

Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. The proposed multi-modal system is build by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech (TTS) and voice conversion (VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.

Tags:

multi-modal

text-to-speech

few-shot voice cloning

voice conversion

IMPROVE FEW-SHOT VOICE CLONING USING MULTI-MODAL LEARNING

Haitong Zhang, Yue Lin

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICIP 2023 COURSE 1: Short Course: Multimodal Learning: Technical Foundation, Hands-on and Applications (Parts 1-4)

Short Course Bundle: ICASSP 2022 COURSE 5: Speech Technology for Health: From Technical Foundations to Applications (Parts 1-3)

SIAMCLIM: TEXT-BASED PEDESTRIAN SEARCH VIA MULTI-MODAL SIAMESE CONTRASTIVE LEARNING

Join the IEEE Signal Processing Society