LOCAL CONTEXT AND DIMENSIONAL RELATION AWARE TRANSFORMER NETWORK FOR CONTINUOUS AFFECT ESTIMATION
Shuo Yang, Yongtang Bao, Yue Qi
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In recent years, video-based continuous affect estimation has received more attention in computer vision. Therefore, how to robustly and accurately model the temporal information during facial expression change is crucial. Hence, we propose a transformer network that incorporates both local context and dimensional correlation to model visual information in an efficient manner. Specifically, noise, such as instantaneous head poses and lighting changes, may affect the model's performance due to the local context insensitivity of the transformer's self-attention layer. Therefore, a local-wise transformer encoder is adopted to enhance the transformer's ability to capture local contextual information. In addition, considering the prior knowledge of the correlation between valence and arousal,we design the va-relevance bootstrap module and the corresponding valence-arousal relevance loss (va loss). Experiments on Aff-Wild2 and AFEW-VA datasets show the superior performance of our method for continuous affect estimation.