A Probabilistic Framework for Pruning Transformers via a Finite Admixture of Keys

Tan Minh Nguyen (University of California, Los Angeles); Tam Minh Nguyen (FPT Software); Long Minh Bui (FPT Software); Hai Do (FPT Software); Duy Khuong Nguyen (FPT Software Ltd. - FPT Corporation); Dung D. D. Le (College of Engineering and Computer Science, VinUniversity); Hung Tran-The (Deakin University); Nhat Ho (University of Texas at Austin); Stanley Osher (UCLA); Richard Baraniuk (Rice University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Pairwise dot product-based self-attention is key to the success of transformers which achieve state-of-the-art performance across a variety of applications in language and vision, but are costly to compute. It has been shown that most attention scores and keys in transformers are redundant and can be removed without loss of accuracy. In this paper, we develop a novel probabilistic framework for pruning attention scores and keys in transformers. We first formulate an admixture model of attention keys whose input data to be clustered are attention queries. We show that attention scores in self-attention correspond to the posterior distribution of this model when attention keys admit a uniform prior distribution. We then relax this uniform prior constraint and let the model learn these priors from data, resulting in a new Finite Admixture of Keys (FiAK). The learned priors are used for pruning away redundant attention scores and keys in the baseline transformers, improving the diversity of attention patterns that the models capture. We corroborate the efficiency of transformers pruned with FiAK on the ImageNet object classification and WikiText-103 language modeling tasks. Our experiments demonstrate that transformers pruned with FiAK yield similar or better accuracy than the baseline dense transformers while being much more efficient in terms of memory and computational cost.

Tags:

Pattern recognition and classification

A Probabilistic Framework for Pruning Transformers via a Finite Admixture of Keys

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

HalluAudio: Hallucinate frequency as concepts for few-shot audio classification

FedSD: A New Federated Learning Structure Used in Non-iid Data

Inv-SENet: Invariant Self Expression Network for clustering under biased data

Join the IEEE Signal Processing Society