CONFORMER-BASED SPEECH RECOGNITION WITH LINEAR NYSTR™M ATTENTION AND ROTARY POSITION EMBEDDING
Lahiru Samarakoon, Tsun-Yat Leung
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:12:52
Self-attention has become an important component for end-to-end (E2E) automatic speech recognition (ASR). Recently, Convolution-augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. However, the computational and memory complexity of self-attention grows quadratically with the input sequence length. Effect of this can be significant for the Conformer encoder when processing longer sequences. In this work, we propose to replace self-attention with a linear complexity Nystr�m attention which is a low-rank approximation of the attention scores based on the Nystr�m method. In addition, we propose to use Rotary Position Embedding (RoPE) with Nystr�m attention since RPE is of quadratic complexity. Moreover, we show that models can be made even lighter by removing self-attention sub-layers from top encoder layers without any drop in the performance. Furthermore, we demonstrate that Convolutional sub-layers in Conformer can effectively recover the information lost due to the Nystr�m approximation.