MSDTRON: A HIGH-CAPABILITY MULTI-SPEAKER SPEECH SYNTHESIS SYSTEM FOR DIVERSE DATA USING CHARACTERISTIC INFORMATION

Qinghua Wu, Quanbo Shen, Jian Luan, Yujun Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:11:33

08 May 2022

In multi-speaker speech synthesis, data from a number of speakers usually tend to have great diversity due to the fact that the speakers may differ largely in ages, speaking styles, emotions, and so on. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper proposes a high-capability speech synthesis system, called Msdtron, in which 1) a representation of the harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram. 2) conditional gated LSTM (CGLSTM) is proposed to control the flow of text content information through the network by re-weighting the gates of LSTM using speaker information. The experiments show a significant reduction in reconstruction error of mel-spectrogram in the training of the multi-speaker model, and a great improvement is observed in the subjective evaluation of speaker adapted model.

Tags:

voice cloning

speaker adaptation

multi-speaker speech synthesis

few-shot speech synthesis

MSDTRON: A HIGH-CAPABILITY MULTI-SPEAKER SPEECH SYNTHESIS SYSTEM FOR DIVERSE DATA USING CHARACTERISTIC INFORMATION

Qinghua Wu, Quanbo Shen, Jian Luan, Yujun Wang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

VOICE FILTER: FEW-SHOT TEXT-TO-SPEECH SPEAKER ADAPTATION USING VOICE CONVERSION AS A POST-PROCESSING MODULE

PVAE-TTS: ADAPTIVE TEXT-TO-SPEECH VIA PROGRESSIVE STYLE ADAPTATION

Join the IEEE Signal Processing Society