AuxFormer: Robust Approach to Audiovisual Emotion Recognition

Lucas Goncalves, Carlos Busso

DOI

10.17023/4mvq-2018

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:25

10 May 2022

A challenging task in audiovisual emotion recognition is to implement neural network architectures that can leverage and fuse multimodal information while temporally aligning modalities, handling missing modalities, and capturing information from all modalities without losing information during training. These requirements are important to achieve model robustness and to increase accuracy on the emotion recognition task. A recent approach to perform multimodal fusion is to use the transformer architecture to properly fuse and align the modalities. This study proposes the AuxFormer framework, which addresses in a principled way the aforementioned challenges. AuxFormer combines the transformer framework with auxiliary networks. It uses shared losses to infuse information from single-modality networks that are separately embedded. The extra layer of audiovisual information added to our main network retains information that would otherwise be lost during training. Results show that the AuxFormer architecture performs 6.8% to 7.2% better on the CREMA-D corpus and 2.3% to 3.5% better on the MSP-IMPROV corpus than state-of-the-art baselines, indicating that our framework benefits from auxiliary networks. We also show that under non-ideal conditions (e.g., missing modalities) our architecture is able to sustain strong performance under audio-only and video-only scenarios, benefiting from the optimized training strategy explored in this study.

Tags:

audiovisual emotion recognition

auxiliary networks

shared losses

multimodal fusion

transformers

AuxFormer: Robust Approach to Audiovisual Emotion Recognition

Lucas Goncalves, Carlos Busso

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICASSP 2022 COURSE 6: Transformer Architectures for Multimodal Signal Processing and Decision Making (Parts 1-3)

Tutorial: Fundamentals of Transformers: A Signal-processing View

LEVERAGING EFFICIENT TRAINING AND FEATURE FUSION IN TRANSFORMERS FOR MULTIMODAL CLASSIFICATION

Join the IEEE Signal Processing Society