Rethink pair-wise self-supervised cross-modal retrieval from a contrastive learning perspective
Tiantian Gong (Nanjing University of Aeronautics and Astronautics); Junsheng Wang (Nanjing University of Science And Technology); Liyan Zhang (Nanjing University of Aeronautics and Astronautics)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Cross-modal retrieval often faces the challenges of eliminating modality gap, learning robust modality invariance and semantic discrimination. Existing self-supervised cross-modal approaches still suffer from the faulty negative sample selection strategy and the lack of reliable high-level semantic discriminative guidance. Therefore, we propose a robust self-supervised co-training instance and semantic discrimination learning method (RCL) for cross-modal retrieval. Specifically, by the k-reciprocal nearest neighbor to generate pairwise pseudo-labels, we can correctly select negative samples and better filter the false negative ones, and thus pull semantically similar instances closer in a similar supervised contrastive learning. In addition, we use prototype contrastive learning to learn high-level semantic discriminative representations from different semantic groups, which pull instances and prototype vectors closer to better learn the semantic structure of multimodal data. Extensive experiments demonstrate the effectiveness of our method on cross-modal datasets.