Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Lecture 09 Oct 2023

In this paper, a global-local contrastive learning framework is proposed to leverage global contextual information from different modalities and then effectively fuse them with the supervision of contrastive learning. First, a global-local encoder is proposed to sufficiently explore the salient contextual information from different modalities, which generates the global contextual information. Second, contrastive learning is used to minimize the semantic distance between the paired modalities, which can improve the content matching between videos and the predicted captions. Finally, an attention-based multimodal encoder is presented to effectively fuse different modalities, thereby generating the multimodal representations that include global contextual information from different modalities. Extensive experimental results on benchmark datasets indicate that our proposed method is superior to the state-of-the-art approaches.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00