Towards Robust Visual Transformer Networks via K-Sparse Attention
Sajjad Amini, Shahrokh Ghaemmaghami
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:15:12
Transformer networks, originally developed in the community of machine translation to eliminate sequential nature of recurrent neural networks, have shown impressive results in other natural language processing and machine vision tasks. Self-attention is the core module behind visual transformers which globally mixes the image information. This module drastically reduces the intrinsic inductive bias imposed by CNNs, such as locality, while encountering insufficient robustness against some adversarial attacks. In this paper we introduce K-sparse attention to preserve low inductive bias, while robustifying transformers against adversarial attacks. We show that standard transformers attend values with dense set of weights, while the sparse attention, automatically selected by an optimization algorithm, can preserve generalization performance of the transformer and, at the same time, improve its robustness.