Skip to main content

Background Disturbance Mitigation for Video Captioning via Entity-Action Relocation

Zipeng Li (Wuhan University of Technology); Xian Zhong (Wuhan University of Technology); Shuqin Chen (Hubei University of Education); Wenxuan Liu (Wuhan University of Technology); Wenxin Huang (Hubei University); Lin Li (Wuhan University of Technology)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
08 Jun 2023

Video captioning aims to generate sentences to accurately describe the video content, in which video background plays the role of prompts. State-of-the-art methods tend to explore richer video representations adequately, fusing with language to improve caption quality, which has shown great success. However, they focus on exploiting foreground semantics, ignoring the potential negative impact of video background disturbance to caption generation, i.e., the entities and the actions are misjudged by a similar video background. To ameliorate this issue, we propose Entity-Action Relocation (EAR) to enhance the adaptability of entities and actions to various backgrounds by giving them the background. Specifically, for an extracted original video feature, we construct a mixed background for all entities and actions to form a distracting video feature sample. After that, contrastive learning is applied to pull the generated caption of the original representations and of the distracting representations closer, and to push the former away from the generated caption of other videos, explicitly concentrating on the entities and actions of the current video scene. Extensive experiments on two public datasets (MSR-VTT and MSVD) demonstrate that dealing with background disturbance for video can obtain a competitive caption generation effect.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00