Code-Switching Speech Synthesis Based on Self- Supervised Learning and Domain Adaptive Speaker Encoder
YiXing Lin (National Central University); Cheng-Hsun Pai (National Central University); Le Phuong (National Central University); Bima Prihasto (National Central University); CHIEN-LIN HUANG (NCKU); Jia-Ching Wang (National Central University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recently, end-to-end speech synthesis models based on deep learning have made great progress in speech quality, and gradually replaced traditional speech synthesis methods into the mainstream. However, these methods are still challenging to synthesize highly natural speech. In order to solve the above problems, we introduce self-supervised learning and frame-level domain adversarial training into the speaker encoder based on the speaker verification task, so that the speaker vectors of different languages keep a consistent distribution in the speaker space, and the performance of speech synthesis is improved. In addition, we use a non- autoregressive speech synthesis model in the selection of speech synthesis model, so as to solve the problem of unnatural speech rate caused by cross-language speech synthesis. We first demonstrate that in the mixed language dataset of LibriTTS and AISHELL3, the speaker encoder trained with self-supervised representation has a 4.968% absolute EER reduction compared to the traditional MFCC on the speaker verification task, indicating that self- supervised representation has better generalization for domain-complex datasets. Then we obtain MOS scores of 3.635 and 3.675 for speech naturalness and speaker similarity in the code-switching speech synthesis task, respectively. Our approach simplifies the need to use multiple monolingual encoders to model linguistic information in the past literature, and adds frame-level domain adversarial training to optimize the speaker vectors in the speaker feature space to facilitate the code-switching speech synthesis task.