CLIP-FG:SELECTING DISCRIMINATIVE IMAGE PATCHES BY CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING FOR FINE-GRAINED IMAGE CLASSIFICATION

Min Yuan, Ningning Lv, Yufei Xie, Fuxiang Lu, Kun Zhan

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 11 Oct 2023

Fine-grained Visual Classification (FGVC), which aims to identify objects from subcategories, presents great challenges for classification due to large intra-class differences and subtle inter-class differences. To address these issues of FGVC, this paper proposes a patch selection model referenced from CLIP for Fine Grained Visual Classification, namely CLIPFG. Specifically, unlike the previous CLIP, which focused only on the level of text and image, we calculate the similarity between labels and image patches. Top k image patches are selected and their indexes fed into the Vision Transformer to select discriminative areas to improve the performance of fine grained image classification. Quantitative evaluations show CLIP-FG’s competitive performance against mainstream methods.

Tags:

Fine-grained image classification

vision transformer

clip