ATTENTION-GUIDED CONTRASTIVE MASKED IMAGE MODELING FOR TRANSFORMER-BASED Self-SUPERVISED LEARNING
Yucheng Zhan, Yucheng Zhao, Chong Luo, Yueyi Zhang, Xiaoyan Sun
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-supervised learning with vision transformer (ViT) has gained much attention recently. Most existing methods rely on either contrastive learning or masked image modeling. The former is suitable for global feature extraction but underperforms in fine-grained tasks. The later explores the internal structure of images but ignores the high information sparsity and unbalanced information distribution. In this paper, we propose a new approach called Attention-guided Contrastive Masked Image Modeling (ACoMIM), which integrates the merits of both paradigms and leverages the attention mechanism of ViT for effective representation. Specifically, it has two pretext tasks, predicting the features of masked regions guided by attention and comparing the global features of masked and unmasked images. We show that these two pretext tasks complement each other and improve our method's performance. The experiments demonstrate that our model transfers well to various downstream tasks such as classification and object detection.