A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING
Qunyue Huang, Bin Fang, Xi Ai
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In this paper, a global-local contrastive learning framework is proposed to leverage global contextual information from different modalities and then effectively fuse them with the supervision of contrastive learning. First, a global-local encoder is proposed to sufficiently explore the salient contextual information from different modalities, which generates the global contextual information. Second, contrastive learning is used to minimize the semantic distance between the paired modalities, which can improve the content matching between videos and the predicted captions. Finally, an attention-based multimodal encoder is presented to effectively fuse different modalities, thereby generating the multimodal representations that include global contextual information from different modalities. Extensive experimental results on benchmark datasets indicate that our proposed method is superior to the state-of-the-art approaches.