PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH

Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:46

11 May 2022

This paper proposes ProsodySpeech, a novel prosody model to enhance encoder-decoder neural Text-To-Speech (TTS), to generate high expressive and personalized speech even with very limited training data. First, we use a Prosody Extractor built from a large speech corpus with various speakers to generate a set of prosody exemplars from multiple reference speeches, in which Mutual Information based Style content separation (MIST) is adopted to alleviate "content leakage" problem. Second, we use a Prosody Distributor to make a soft selection of appropriate prosody exemplars in phone-level with the help of an attention mechanism. The resulting prosody feature is then aggregated into the output of text encoder, together with additional phone-level pitch feature to enrich the prosody. We apply this method into two tasks: highly expressive multi style/emotion TTS and few-shot personalized TTS. The experiments show the proposed model outperforms baseline FastSpeech 2 + GST with significant improvements in terms of similarity and style expression.

Tags:

attention

few-shot

prosody

mist

tts

PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH

Yuanhao Yi, Lei He, Shifeng Pan, Xi Wang, Yujia Xiao

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICASSP 2022 COURSE 5: Speech Technology for Health: From Technical Foundations to Applications (Parts 1-3)

Short Course Bundle: ICASSP 2022 COURSE 2: SC-2c: Inclusive Neural Speech Synthesis -iNSS (Parts 1-3), May 2022

FEW-SHOT HYPERSPECTRAL IMAGE CLASSIFICATION WITH SPECTRAL-SPATIAL FEATURE FUSION BASED ON FUZZY BROAD LEARNING SYSTEM

Join the IEEE Signal Processing Society