JOINT AND ADVERSARIAL TRAINING WITH ASR FOR EXPRESSIVE SPEECH SYNTHESIS

Kaili Zhang, Cheng Gong, Wenhuan Lu, Longbiao Wang, Jianguo Wei, Dawei Liu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:13:52

08 May 2022

Style modeling is an important issue and has been proposed in expressive speech synthesis. In existing unsupervised methods, the style encoder extracts the latent representation from the reference audio as style information. However, the style information extracted from the style encoder will entangle some content information, which will cause conflicts with the real input content, and the synthesized speech will be influenced. In this study, we propose to alleviate the entanglement problem by integrating Text-To-Speech (TTS) model and Automatic Speech Recognition (ASR) model with a share layer network for joint training, and using ASR adversarial training to eliminate the content information in the style information. At the same time, we propose an adaptive adversarial weight learning strategy to prevent the model from collapsing. The objective evaluation using word error rate(WER) demonstrates that our method can effectively alleviate the entanglement between style and content information. Subjective evaluation indicates that the method improves the quality of synthesized speech and enhances the ability of style transfer compared with the baseline models.

Tags:

automatic speech recognition

style disentanglement

style modeling

expressive speech synthesis

JOINT AND ADVERSARIAL TRAINING WITH ASR FOR EXPRESSIVE SPEECH SYNTHESIS

Kaili Zhang, Cheng Gong, Wenhuan Lu, Longbiao Wang, Jianguo Wei, Dawei Liu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

End-to-End Automatic Speech Recognition

Towards a Speech Version of ChatGPT

Neural Signal Interpretation for Spoken Communication

Join the IEEE Signal Processing Society