Deep Residual Networks With Common Linear Multi-Step and Advanced Numerical Schemes
Zhengbo Luo, Weilian Zhou, Sei-ichiro Kamata, Xuehui Hu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:13
This paper presents an efficient multi-scale vision transformer, called CBPT, that capably serves as a general-purpose backbone for computer vision. A challenging issue in transformer design is that window self-attention(WSA) often limits the information transmission of each token, whereas enlarging WSA?s receptive field is very expensive to compute. To address this issue, we develop the Locally-Enhanced Window Self-attention mechanism to double the receptive field and have a similar computational complexity to the typical WSA. in addition, we also propose information-Enhanced Patch Merging, which solves the loss of information in sampling the attention map. incorporated with these designs and the Cross Block Partial connection, CBPT not only significantly surpasses Swin by +1 box AP and mask AP on COCO object detection and instance segmentation, but also has 30% fewer parameters and 35% fewer FLOPs than Swin.