Enhancing Multimodal Alignment with Momentum Augmentation for Dense Video Captioning
yiwei wei (Tianjin university); Shaozu Yuan (JD AI ); Meng Chen (JD AI); Longbiao Wang (Tianjin University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Dense video captioning aims to localize multiple events from an untrimmed video and generate corresponding captions for each event. Fusing different modalities(e.g. rgb, flow, audio) via transformer structure is a promising way to improve the caption performance. However, it is challenging for the cross-modal encoder to learn multimodal interactions due to their inherent disparities of distribution. In this paper, we propose a novel transformer structure with contrastive learning to align different modalities. Specifically, to avoid the limitation of small batch size and false contrastive targets, we design an event-aligned momentum augmentation strategy to apply contrast learning for dense video captioning. The experimental result shows that our proposals outperform all existing multimodal fusion methods for dense video captioning.