Code-Switching Speech Synthesis Based on Self- Supervised Learning and Domain Adaptive Speaker Encoder

YiXing Lin (National Central University); Cheng-Hsun Pai (National Central University); Le Phuong (National Central University); Bima Prihasto (National Central University); CHIEN-LIN HUANG (NCKU); Jia-Ching Wang (National Central University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Recently, end-to-end speech synthesis models based on deep learning have made great progress in speech quality, and gradually replaced traditional speech synthesis methods into the mainstream. However, these methods are still challenging to synthesize highly natural speech. In order to solve the above problems, we introduce self-supervised learning and frame-level domain adversarial training into the speaker encoder based on the speaker verification task, so that the speaker vectors of different languages keep a consistent distribution in the speaker space, and the performance of speech synthesis is improved. In addition, we use a non- autoregressive speech synthesis model in the selection of speech synthesis model, so as to solve the problem of unnatural speech rate caused by cross-language speech synthesis. We first demonstrate that in the mixed language dataset of LibriTTS and AISHELL3, the speaker encoder trained with self-supervised representation has a 4.968% absolute EER reduction compared to the traditional MFCC on the speaker verification task, indicating that self- supervised representation has better generalization for domain-complex datasets. Then we obtain MOS scores of 3.635 and 3.675 for speech naturalness and speaker similarity in the code-switching speech synthesis task, respectively. Our approach simplifies the need to use multiple monolingual encoders to model linguistic information in the past literature, and adds frame-level domain adversarial training to optimize the speaker vectors in the speaker feature space to facilitate the code-switching speech synthesis task.

Tags:

Multimedia analysis and synthesis

Code-Switching Speech Synthesis Based on Self- Supervised Learning and Domain Adaptive Speaker Encoder

YiXing Lin (National Central University); Cheng-Hsun Pai (National Central University); Le Phuong (National Central University); Bima Prihasto (National Central University); CHIEN-LIN HUANG (NCKU); Jia-Ching Wang (National Central University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

TWO-STREAM JOINT-TRAINING FOR SPEAKER INDEPENDENT ACOUSTIC-TO-ARTICULATORY INVERSION

Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing

A dataset for Audio-Visual Sound Event Detection in Movies

Join the IEEE Signal Processing Society