How Much Self-Attention Do We Need? Trading Attention For Feed-Forward Layers

Kazuki Irie, Alexander Gerstenberger, Ralf SchlÃ¼ter, Hermann Ney

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 12:06

04 May 2020

We propose simple architectural modifications in the standard Transformer with the goal to reduce its total state size (defined as the number of self-attention layers times the sum of the key and value dimensions, times position) without loss of performance. Large scale Transformer language models have been empirically proved to give very good performance. However, scaling up results in a model that needs to store large states at evaluation time. This can increase the memory requirement dramatically for search e.g., in speech recognition (first pass decoding, lattice rescoring, or shallow fusion). In order to efficiently increase the model capacity without increasing the state size, we replace the single-layer feed-forward module in the Transformer layer by a deeper network, and decrease the total number of layers. In addition, we also evaluate the effect of key-value tying which directly divides the state size in half. On TED-LIUM 2, we obtain a model of state size 4 times smaller than the standard Transformer, with only 2% relative loss in terms of perplexity, which makes the deployment of Transformer language models more convenient.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

How Much Self-Attention Do We Need? Trading Attention For Feed-Forward Layers

Kazuki Irie, Alexander Gerstenberger, Ralf SchlÃ¼ter, Hermann Ney

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society