Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Lecture 10 Oct 2023

In recent years, video-based continuous affect estimation has received more attention in computer vision. Therefore, how to robustly and accurately model the temporal information during facial expression change is crucial. Hence, we propose a transformer network that incorporates both local context and dimensional correlation to model visual information in an efficient manner. Specifically, noise, such as instantaneous head poses and lighting changes, may affect the model's performance due to the local context insensitivity of the transformer's self-attention layer. Therefore, a local-wise transformer encoder is adopted to enhance the transformer's ability to capture local contextual information. In addition, considering the prior knowledge of the correlation between valence and arousal,we design the va-relevance bootstrap module and the corresponding valence-arousal relevance loss (va loss). Experiments on Aff-Wild2 and AFEW-VA datasets show the superior performance of our method for continuous affect estimation.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00