Rethink pair-wise self-supervised cross-modal retrieval from a contrastive learning perspective

Tiantian Gong (Nanjing University of Aeronautics and Astronautics); Junsheng Wang (Nanjing University of Science And Technology); Liyan Zhang (Nanjing University of Aeronautics and Astronautics)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Cross-modal retrieval often faces the challenges of eliminating modality gap, learning robust modality invariance and semantic discrimination. Existing self-supervised cross-modal approaches still suffer from the faulty negative sample selection strategy and the lack of reliable high-level semantic discriminative guidance. Therefore, we propose a robust self-supervised co-training instance and semantic discrimination learning method (RCL) for cross-modal retrieval. Specifically, by the k-reciprocal nearest neighbor to generate pairwise pseudo-labels, we can correctly select negative samples and better filter the false negative ones, and thus pull semantically similar instances closer in a similar supervised contrastive learning. In addition, we use prototype contrastive learning to learn high-level semantic discriminative representations from different semantic groups, which pull instances and prototype vectors closer to better learn the semantic structure of multimodal data. Extensive experiments demonstrate the effectiveness of our method on cross-modal datasets.

Tags:

Multi-modal signal processing and analysis (audio/visual/haptics/radar/lidar etc.)

Rethink pair-wise self-supervised cross-modal retrieval from a contrastive learning perspective

Tiantian Gong (Nanjing University of Aeronautics and Astronautics); Junsheng Wang (Nanjing University of Science And Technology); Liyan Zhang (Nanjing University of Aeronautics and Astronautics)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

BIRD-PCC: Bi-directional Range Image-based Deep LiDAR Point Cloud Compression

Guide and Select: A Transformer-based Multimodal Fusion Method for Points of Interest Description Generation

Improving Few-Shot Learning for Talking Face System with TTS Data Augmentation

Join the IEEE Signal Processing Society