FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION

Yu-Xuan Xi, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:54

11 May 2022

Speech emotion recognition (SER) with limited size dataset is a challenging task, since a spoken utterance contains various disturbing attributes besides emotion, including speaker, content, and language etc. However, due to a close relationship between speaker and emotion attributes, simply fine-tuning a linear model is enough to obtain a good SER performance on the utterance-level embeddings (i.e., i-vector and x-vectors) extracted from the pre-trained speaker recognition(SR) frontends. In this paper, we aim to perform frontend attributes disentanglement (AD) for SER task, using a pre-trained SR model. Specifically, the AD module consists of attribute normalization (AN) and attribute reconstruction (AR) phases is proposed. The AN filters out the variation information using instance normalization (IN), and AR reconstructs the emotion-relevant features from the residual space to ensure the high emotion discrimination. For better disentanglement, an dual space loss is then designed to encourage the separability of emotion-relevant and emotion-irrelevant spaces. To introduce the long-range contextual information for emotion related reconstruction, a time-frequency(TF) attention is further proposed. Different from the style disentanglement of the extracted x-vectors, the proposed AD module can be applied on frontend feature extractor. Experiments on IEMOCAP benchmark demonstrate the effectiveness of the proposed method.

Tags:

disentanglement

speech emotion recognition

style transformation

convolutional neural network

FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION

Yu-Xuan Xi, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Segmentation of the Left Ventricle by SDD double threshold selection and CHT

WEAKLY SUPERVISED DISENTANGLEMENT WITH TRIPLET NETWORK

INTER-SCALE SURE-LET IMAGE RESTORATION WITH DEEP UNROLLED IMAGE PRIOR

Join the IEEE Signal Processing Society