A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING

Qunyue Huang, Bin Fang, Xi Ai

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Lecture 09 Oct 2023

In this paper, a global-local contrastive learning framework is proposed to leverage global contextual information from different modalities and then effectively fuse them with the supervision of contrastive learning. First, a global-local encoder is proposed to sufficiently explore the salient contextual information from different modalities, which generates the global contextual information. Second, contrastive learning is used to minimize the semantic distance between the paired modalities, which can improve the content matching between videos and the predicted captions. Finally, an attention-based multimodal encoder is presented to effectively fuse different modalities, thereby generating the multimodal representations that include global contextual information from different modalities. Extensive experimental results on benchmark datasets indicate that our proposed method is superior to the state-of-the-art approaches.

Tags:

video captioning

contrastive learning

local encoder

global encoder

multimodal encoder