CXRMIM: MASKED IMAGE MODELING PRE-TRAINING PARADIGM FOR CHEST X-RAY IMAGES ANALYSIS
Zhendong Wang, Haowen Ma, Jianwei Niu
-
SPS
IEEE Members: $11.00
Non-members: $15.00
As an effective approach for Vision Transformers (ViT) to ob- tain better initializations and representations in natural image analysis, Masked image modeling (MIM), performs the pre- text task of reconstructing images by adopting partial obser- vations without any label. Several works adopted dissimilar mask strategies to make ViT aggregate contextual information to infer missed contents. Nonetheless, chest radiographs con- spicuously differ from photographic images, and conducting MIM in chest X-rays remains challenging. On that account, this paper came up with a specialized pre-training recipe cxr- MIM and a masking strategy for chest radiographs on the ba- sis of their physiological characters. In cxrMIM, the out-of- lung region was first analyzed, and the lung region was then reconstructed with the help of mechanical connections and similarities of anatomy and physiology. cxrMIM facilitates ViT to excavate the commonalities of pulmonary structures and promote better performance on downstream tasks. We conducted experiments on the ChestX-ray 14 dataset using advanced self-supervised methods (e.g. MoCo v3, MAE) for comparison. Quantitative and qualitative results signified that cxrMIM reinforced the efficiency of Vision Transformer to resolve the multi-label thorax disease classification problem, and cxrMIM pretrained Swin-B performed comparably to the state-of-the-art CNN models.