Skip to main content

Finding Optimal Numerical Format for Sub-8-bit Post-Training Quantization of Vision Transformers

Janghwan Lee (Hanyang University); Youngdeok Hwang (Baruch College - The City University of New York (CUNY)); Jungwook Choi (Hanyang University)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Vision Transformers (ViTs) have gained significant attention for their exceptional model accuracies on computer vision applications, but their demanding memory requirements and computational complexity have hindered active deployment. Post-training quantization (PTQ) is a practical method to tackle this challenge by directly reducing ViT's bit-precision. However, diverse data characteristics across different operations of ViT cannot be well captured solely by a single numerical format (fixed or floating-point). This work proposes an analytical framework that optimizes the numerical format of each matrix multiplication of ViTs for mixed-format sub-8-bit quantization. The extensive evaluation demonstrates that the proposed method can reduce the PTQ error and achieve state-of-the-art accuracy for popular ViT models.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00