Audio-Video Fusion with Double Attention for Multimodal Emotion Recognition

Ruxandra Tapu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 0:15:06

28 Jun 2022

Recently, the multimodal emotion recognition has become a hot topic of research, within the affective computing community, due to its robust performances. In this paper, we propose to analyze emotions in an end-to-end manner based on various convolutional neural networks (CNN) architectures and attention mechanisms. Specifically, we develop a new framework that integrates the spatial and temporal attention into a visual 3D-CNN and temporal attention into an audio 2D-CNN in order to capture the intra-modal features characteristics. Further, the system is extended with an audio-video cross-attention fusion approach that effectively exploits the relationship across the two modalities. The proposed method achieves 87.89% of accuracy on RAVDESS dataset. When compared with state-of-the art methods our system demonstrates accuracy gains of more than 1.89%.

Tags:

IVMSP 2022

June 2022

2022

IVMSP

IEEE IVMSP 2022

June 26

Nafplio