FINE-GRAINED EMOTIONAL CONTROL OF TEXT-TO-SPEECH: LEARNING TO RANK INTER- AND INTRA-CLASS EMOTION INTENSITIES

Shijun Wang (University of St. Gallen); Jon Gudnason (Reykjavik University); Damian Borth (University of St. Gallen)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.

Tags:

Speech emotion detection and analysis

FINE-GRAINED EMOTIONAL CONTROL OF TEXT-TO-SPEECH: LEARNING TO RANK INTER- AND INTRA-CLASS EMOTION INTENSITIES

Shijun Wang (University of St. Gallen); Jon Gudnason (Reykjavik University); Damian Borth (University of St. Gallen)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Emotion Recognition in Conversation from Variable-Length Context

Tranferring Quantified Emotion Knowledge for the Detection of Depression in Alzheimer's Disease Using ForestNets

DST: DEFORMABLE SPEECH TRANSFORMER FOR EMOTION RECOGNITION

Join the IEEE Signal Processing Society