Boosting Fine-grained Sketch-based Image Retrieval with Self-supervised Learning

Zhaolong Zhang (Fudan University); Yangdong Chen (Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Fine-grained sketch-based image retrieval (FG-SBIR) aims at aligning images and sketches at the instance level. It is a challenging task as there are significant differences between sketch and image. Existing methods usually produce less desired performance due to the lack of large-scale fine-grained image-sketch datasets and the strong dependence on the classification models pretrained on ImageNet. In this paper, we propose a better self-supervised pre-trained FG-SBIR model which does not depend on large-scale annotated datasets. Only images and their corresponding edge maps are used at the pre-training stage. Mixed modal transformation is designed to generate different mixed-up views. The FG-SBIR model is pre-trained by minimizing the distance between the views of the same instance and then fine-tuned by a simple triplet loss. With a plain downstream network, it achieves generally better performance than state-of-the-art models on three widely used FG-SBIR datasets.

Tags:

Multi-modal signal processing and analysis (audio/visual/haptics/radar/lidar etc.)

Boosting Fine-grained Sketch-based Image Retrieval with Self-supervised Learning

Zhaolong Zhang (Fudan University); Yangdong Chen (Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

The Multimodal Information Based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

BIRD-PCC: Bi-directional Range Image-based Deep LiDAR Point Cloud Compression

Guide and Select: A Transformer-based Multimodal Fusion Method for Points of Interest Description Generation

Join the IEEE Signal Processing Society