PROMPTTTS: CONTROLLABLE TEXT-TO-SPEECH WITH TEXT DESCRIPTIONS

Zhifang Guo (University of Chinese Academy of Sciences); Yichong Leng (University of Science and Technology of China); Yihan Wu (Renmin University of China); sheng zhao (microsoft); Xu Tan (Microsoft Research Asia)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ''A lady whispers to her friend slowly''). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available\url{https://prompttts.github.io/prompttts}.

Tags:

Multimodal processing of language

PROMPTTTS: CONTROLLABLE TEXT-TO-SPEECH WITH TEXT DESCRIPTIONS

Zhifang Guo (University of Chinese Academy of Sciences); Yichong Leng (University of Science and Technology of China); Yihan Wu (Renmin University of China); sheng zhao (microsoft); Xu Tan (Microsoft Research Asia)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus

Join the IEEE Signal Processing Society