Finding Optimal Numerical Format for Sub-8-bit Post-Training Quantization of Vision Transformers
Janghwan Lee (Hanyang University); Youngdeok Hwang (Baruch College - The City University of New York (CUNY)); Jungwook Choi (Hanyang University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Vision Transformers (ViTs) have gained significant attention for their exceptional model accuracies on computer vision applications, but their demanding memory requirements and computational complexity have hindered active deployment. Post-training quantization (PTQ) is a practical method to tackle this challenge by directly reducing ViT's bit-precision. However, diverse data characteristics across different operations of ViT cannot be well captured solely by a single numerical format (fixed or floating-point). This work proposes an analytical framework that optimizes the numerical format of each matrix multiplication of ViTs for mixed-format sub-8-bit quantization. The extensive evaluation demonstrates that the proposed method can reduce the PTQ error and achieve state-of-the-art accuracy for popular ViT models.