Multimodal Transformer With Learnable Frontend and Self Attention for Emotion Recognition

Soumya Dutta, Sriram Ganapathy

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:09:13

09 May 2022

In this work, we propose a novel approach for multi-modal emotion recognition from conversations using speech and text. The audio representations are learned jointly with a learnable audio front-end (LEAF) model feeding to a CNN based classifier. The text representations are derived from pre-trained bidirectional encoder representations from transformer (BERT) along with a gated recurrent network (GRU). Both the textual and audio representations are separately processed using a bidirectional GRU network with self-attention. Further, the multi-modal information extraction is achieved using a transformer that is input with the textual and audio embeddings at the utterance level. The experiments are performed on the IEMOCAP database, where we show that the proposed framework improves over the current state-of-the-art results under all the common test settings. This is primarily due to the improved emotion recognition performance achieved in the audio domain. Further, we also show that the model is more robust to textual errors caused by an automatic speech recognition (ASR) system.

Tags:

self-attention models

transformer networks

multi-modal emotion recognition

learnable front-end.

Multimodal Transformer With Learnable Frontend and Self Attention for Emotion Recognition

Soumya Dutta, Sriram Ganapathy

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

MULTI-VIEW VARIATIONAL RECURRENT NEURAL NETWORK FOR HUMAN EMOTION RECOGNITION USING MULTI-MODAL BIOLOGICAL SIGNALS

THE MULTIVARIATE TRANSFORMER NETWORK FOR MILD COGNITIVE IMPAIRMENT IDENTIFICATION

FORENSIC ANALYSIS AND LOCALIZATION OF MULTIPLY COMPRESSED MP3 AUDIO USING TRANSFORMERS

Join the IEEE Signal Processing Society