Emotion Controllable Speech Synthesis Using Emotion-Unlabeled Dataset With The Assistance Of Cross-Domain Speech Emotion Recognition

Xiong Cai, Dongyang Dai, Zhiyong Wu, Xiang Li, Jingbei Li, Helen Meng

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:11:57

08 Jun 2021

Neural text-to-speech (TTS) approaches generally require a huge number of high quality speech data, which makes it difficult to obtain such a dataset with extra emotion labels. In this paper, we propose a novel approach for emotional TTS synthesis on a TTS dataset without emotion labels. Specifically, our proposed method consists of a cross-domain speech emotion recognition (SER) model and an emotional TTS model. Firstly, we train the cross-domain SER model on both SER and TTS datasets. Then, we use soft labels on TTS datasets predicted by the trained SER model to build an auxiliary SER task that is jointly trained with the TTS model. Experimental results show that our proposed method can generate speech with the specified emotional expressiveness and nearly no hindering on the speech quality.

Chairs:

Yu Zhang

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021