IMPROVE FEW-SHOT VOICE CLONING USING MULTI-MODAL LEARNING
Haitong Zhang, Yue Lin
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:46
Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. The proposed multi-modal system is build by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech (TTS) and voice conversion (VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.