RECURSIVE JOINT ATTENTION FOR AUDIO-VISUAL FUSION IN REGRESSION BASED EMOTION RECOGNITION
Gnana Praveen Rajasekhar (Ecole Technologie Superieure); Eric Granger (ETS Montreal ); Patrick Cardinal (École de technologie supérieure)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In video-based emotion recognition (ER), it is important to effectively leverage the complementary relationship among audio (A) and visual (V) modalities, while retaining the intra-modal characteristics of individual modalities. In this paper, we present a recursive joint attention model that includes long short-term memory (LSTM) modules for fusion of vocal and facial expressions in regression-based ER. Specifically, we investigated the possibility of exploiting the complementary nature of A and V modalities using joint cross attention model in a recursive fashion and LSTMs to capture the intra-modal temporal dependencies within the same modalities as well as among the A-V feature representations. By integrating LSTMs with recursive joint cross attention, our model can efficiently leverage both intra- and inter-modal relationships for fusion of A and V modalities. The results of extensive experiments performed on the challenging Affwild2 and Fatigue (private) datasets indicate that the proposed A-V fusion models can significantly outperform state-of-the-art-methods.