GMML is All you Need
Sara Atito, Muhammed Awais, Srinivasa Nandam, Josef Kittler
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Vision transformers (ViTs) have generated significant interest in the computer vision community because of their flexibility in exploiting contextual information, whether it is sharply confined local, or long range global. However, they are known to be data hungry and therefore often pretrained on large-scale datasets, e.g. JFT-300M or ImageNet. An ideal learning method would perform best regardless of the size of the dataset, a property lacked by current learning methods, with merely a few existing works studying ViTs with limited data. We propose Group Masked Model Learning (GMML), a self-supervised learning (SSL) method that is able to train ViTs and achieve state-of-the-art (SOTA) performance when pre-trained with limited data. The GMML uses the information conveyed by all concepts in the image. This is achieved by manipulating randomly groups of connected tokens, successively covering different meaningful parts of the image content, and then recovering the hidden information from the visible part of the concept. Unlike most of the existing SSL approaches, GMML does not require momentum encoder, nor relies on careful implementation details such as large batches and gradient stopping. Pretraining, finetuning, and evaluation codes are available under: https://github.com/Sara-Ahmed/GMML