CLIP-FG:SELECTING DISCRIMINATIVE IMAGE PATCHES BY CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING FOR FINE-GRAINED IMAGE CLASSIFICATION
Min Yuan, Ningning Lv, Yufei Xie, Fuxiang Lu, Kun Zhan
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Fine-grained Visual Classification (FGVC), which aims to identify objects from subcategories, presents great challenges for classification due to large intra-class differences and subtle inter-class differences. To address these issues of FGVC, this paper proposes a patch selection model referenced from CLIP for Fine Grained Visual Classification, namely CLIPFG. Specifically, unlike the previous CLIP, which focused only on the level of text and image, we calculate the similarity between labels and image patches. Top k image patches are selected and their indexes fed into the Vision Transformer to select discriminative areas to improve the performance of fine grained image classification. Quantitative evaluations show CLIP-FG’s competitive performance against mainstream methods.