LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Fuyan Ma (Hunan University); Bin Sun (Hunan University); Shutao Li (Hunan University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs and computational costs. To solve these problems, the local-global spatio-temporal Transformer (LOGO-Former) is proposed to capture discriminative features within each frame and model contextual relationships among frames while balancing the complexity. Based on the priors that facial muscles move locally and facial expressions gradually change, we first restrict both the space attention and the time attention to a local window to capture local interactions among feature tokens. Furthermore, we perform the global attention by querying a token with features from each local window iteratively to obtain long-range information of the whole video sequence. In addition, we propose the compact loss regularization term to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.

Tags:

Image and video content analysis

LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition

Fuyan Ma (Hunan University); Bin Sun (Hunan University); Shutao Li (Hunan University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ENHANCED GM-PHD FILTER FOR REAL TIME SATELLITE MULTI-TARGET TRACKING

Semi-Federated Learning for Edge Intelligence with Imperfect SIC

IMAGE COMPLETION VIA DUAL-PATH COOPERATIVE FILTERING

Join the IEEE Signal Processing Society