Zero-shot Cross-lingual Transfer using multi-stream encoder and efficient speaker representation
Yibin Zheng, Zewang Zhang, Xinhui Li, Wenchao Su, Li Lu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:12:39
We propose a novel method for zero-shot cross-lingual TTS task by using multi-stream text encoder and efficient speaker representation. Specifically, a unified multi-stream text encoder that takes both advantages of Transformer and CBHG is proposed to retain multiple hypotheses about input representations. For Transformer based stream, a multi-stream Transformer is further proposed to strengthen these hypotheses. Then the speaker representations are extracted from audio signals by a speaker encoder with a random sampling mechanism and a language adversarial loss, aiming to extract speaker embedding features that are independent of both content information and language identity. Meanwhile, we propose an efficient zero-shot cross-lingual transfer strategy with the help of other target lingual speakers' data and a language-balanced sampling strategy. The Experimental results show the proposed method not only could achieve higher speech quality and speaker similarity (with an average absolute improvement of 0.38 and 0.27 in MOS respectively) for zero-shot cross-lingual transfer, but also helpful for few-shot cross-lingual transfer in which has multi-lingual data.