Trust Your Partner's Friends: Hierarchical Cross-modal Contrastive Pre-training for Video-Text Retrieval
Yuhan Xiang (Xiamen University); Kaijian Liu (SenseTime Group Limited); Shixiang Tang (The University of Sydney); Lei Bai (Shanghai AI Laboratory); Feng Zhu (University of Science and Technology of China); Rui Zhao (SenseTime Group Limited); Xianming Lin (Xiamen University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video-text retrieval has greatly benefited from the massive web video in recent years, while the performance is still limited to the weak supervision from the uncurated data. In this work, we propose to leverage the well-represented information of each original modality and exploit complementary information in two views of the same video, i.e., video clips and captions, by using one view to obtain positive samples with the neighboring samples of the other. Respecting the hierarchical organization of real-world data, we further design a hierarchical cross-modal pre-training method (HCP) to learn good representations in the common embedding space. We evaluate the pre-trained model on three downstream tasks, i.e. text-to-video retrieval, action step localization and video question answering and our method outperforms previous works under the same setting.