Learning Cross-modal Audiovisual Representations with Ladder Networks for Emotion Recognition
Lucas Goncalves (The University of Texas at Dallas); Carlos Busso (University of Texas at Dallas)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Representation learning is a challenging, but essential task in audiovisual learning. A key challenge is to generate strong cross-modal representations while still capturing discriminative information contained in unimodal features. Properly capturing this information is important to increase accuracy and robustness in audiovisual tasks. Focusing on emotion recognition, this study proposes a novel cross-modal ladder networks to capture modality-specific information while building strong cross-modal representations. Our method utilizes representations from a backbone network to implement unsupervised auxiliary tasks to reconstruct intermediate layer representations across the acoustic and visual networks. The skip connections between cross-modal encoder and decoder provide powerful modality-specific and multimodal representations for emotion recognition. Our model on the CREMA-D corpus achieves high performance with precision, recall, and F1 scores over 80% on a six-class problem.