M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Jinlong Xue (Beijing University of Posts and Telecommunications); Yayue Deng (Beijing University of Posts and Telecommunications); Fengping Wang (Beijing University of Posts and Telecommunications); Ya Li (Beijing University of Posts and Telecommunications); Yingming Gao (Beijing University of Posts and Telecommunications); Jianhua Tao ("National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences"); Jianqing Sun (Unisound AI Technology Co.,Ltd); Jiaen Liang (Unisound)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

Tags:

Multimodal processing of language

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Exploring complementary features in multi-modal speech emotion recognition

Egocentric Audio-Visual Noise Suppression

Join the IEEE Signal Processing Society