3D-CSL: SELF-SUPERVISED 3D CONTEXT SIMILARITY LEARNING FOR NEAR-DUPLICATE VIDEO RETRIEVAL
Rui Deng, Qian Wu, Yuke Li
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In this paper, we introduce 3D-CSL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR), and explore a novel self-supervised learning strategy for video similarity learning. Most previous NDVR methods depend a lot on pair-wise labeled data, so that be limited by the scale of datasets and cannot optimize complex but efficient backbones, e.g., 3D transformers. In order to break this limitation, we explore the self-supervised similarity learning for the NDVR task and propose FCS loss, a novel triplet loss, and ShotMix, a novel video-specific augmentation, which enhances the self-supervised video similarity learning significantly. With this premise, the compact 3D pipeline we propose shows a great advantage in extracting global spatiotemporal dependencies in videos and achieves the best balance between efficiency and effectiveness. Furthermore, we also propose PredMAE to pretrain the 3D transformer with video prediction task as a pretext task to boost the downstream NDVR task without any human labels. The experiments on FIVR-200K and CC_WEB_VIDEO demonstrate the superiority and reliability of our method, which achieves the state-of-the-art performance on clip-level NDVR. Code is released in https://github.com/dun-research/3D-CSL.