Look and Think: Intrinsic Unification of Self-attention and Convolution for Spatial-Channel Specificity

Xiang Gao (South China University of Technology); Honghui Lin (South China University of Technology); Yu Li (South China University of Technology); Ruiyan Fang (South China University of Technology); Xin Zhang (South China University of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Convolution and self-attention are popular paradigms and many works take them as two separate components to explore their potential combination. In this work, we consider their intrinsic properties in spatial and channel domains for vision representation. Convolution has the great property of channel-specificity to ``think'' to refine diverse features, and it has the weaknesses of spatial-independent to perceive different regions. Self-attention has the great property of spatial-specificity to ``look'' to perceive various regions, and it has the weaknesses of channel-insensible to summarize different features. With the intrinsic insight of this, we combine the spatial-specificity of self-attention and channel-specificity of convolution to effectively compensate their respective weakness. We propose a unified module, termed as SCS module, to achieve the combinative advantage of Spatial-Channel Specificity. Specifically, SCS module calculates dynamic attention weight with self-attention mechanism, followed by a weighted sum of input features similar to convolution. Extensive experiments show that SCS module improves both of the CNN and Transformer models on image classification and downstream tasks. The visualizations show the outstanding ability of SCS module for vision representation.

Tags:

Image and video coding