CROSS-MODAL FUSION TECHNIQUES FOR UTTERANCE-LEVEL EMOTION RECOGNITION FROM TEXT AND SPEECH

JIACHEN LUO (Queen Mary University of London); Huy Phan (Amazon Alexa); Joshua D. Reiss (Queen Mary University of London)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modeling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and residual module are employed to model long-term contextual dependencies and learn modality-specific patterns. We evaluate the approach on the MELD dataset and the experimental results show the proposed approach achieves the state-of-art performance on the dataset.

Tags:

Speech emotion detection and analysis

CROSS-MODAL FUSION TECHNIQUES FOR UTTERANCE-LEVEL EMOTION RECOGNITION FROM TEXT AND SPEECH

JIACHEN LUO (Queen Mary University of London); Huy Phan (Amazon Alexa); Joshua D. Reiss (Queen Mary University of London)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations

AN EMPIRICAL STUDY AND IMPROVEMENT FOR SPEECH EMOTION RECOGNITION

Emotion Recognition in Conversation from Variable-Length Context

Join the IEEE Signal Processing Society