Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Poster 10 Oct 2023

Video captioning is fundamental for visual understanding, which aims at describing the content of a video using natural language. Previous works make great efforts to visual representation learning by comparing the generated sentences and the ground truth in a supervised way. However, they neglect to explore linguistic semantics adequately due to the insufficient learning of visual words. In this paper, we propose the Semantic Learning Network (SLN) that explicitly learns specific semantics of all the visual words by aggregating static and dynamic features. Besides, to achieve controllable video captioning and alleviate the problem that one identical video is mapped to multiple annotations during training, we propose the predicate-based feature selection approach to convert the video to different textual features under the guidance of the predicate in the captions. Experimental results on the MSVD and MSR-VTT datasets show that our SLN outperforms recent state-of-the-art methods.