Zero-shot Cross-lingual Transfer using multi-stream encoder and efficient speaker representation

Yibin Zheng, Zewang Zhang, Xinhui Li, Wenchao Su, Li Lu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:39

12 May 2022

We propose a novel method for zero-shot cross-lingual TTS task by using multi-stream text encoder and efficient speaker representation. Specifically, a unified multi-stream text encoder that takes both advantages of Transformer and CBHG is proposed to retain multiple hypotheses about input representations. For Transformer based stream, a multi-stream Transformer is further proposed to strengthen these hypotheses. Then the speaker representations are extracted from audio signals by a speaker encoder with a random sampling mechanism and a language adversarial loss, aiming to extract speaker embedding features that are independent of both content information and language identity. Meanwhile, we propose an efficient zero-shot cross-lingual transfer strategy with the help of other target lingual speakers' data and a language-balanced sampling strategy. The Experimental results show the proposed method not only could achieve higher speech quality and speaker similarity (with an average absolute improvement of 0.38 and 0.27 in MOS respectively) for zero-shot cross-lingual transfer, but also helpful for few-shot cross-lingual transfer in which has multi-lingual data.

Tags:

multi-stream

speaker representations

zero-shot

cross-lingual transfer

end-to-end neural tts

Zero-shot Cross-lingual Transfer using multi-stream encoder and efficient speaker representation

Yibin Zheng, Zewang Zhang, Xinhui Li, Wenchao Su, Li Lu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ZERO-SHOT HYPERSPECTRAL IMAGE DENOISING WITH SELF-COMPLETION WITH PATTERNED MASKS

A MULTI-STREAM NETWORK FOR MESH DENOISING VIA GRAPH NEURAL NETWORKS WITH GAUSSIAN CURVATURE

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

Join the IEEE Signal Processing Society