Self-Convolution for Automatic Speech Recognition
Tian-Hao Zhang (University of Science and Technology Beijing); Qi Liu (University of Science and Technology Beijing); Xinyuan Qian (USTB); Song-Lu Chen (University of Science and Technology); Feng Chen (EEasy Technology Co. LTD); Xu-Cheng Yin (University of Science and Technology Beijing)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-attention plays a significant role in recent automatic speech recognition (ASR) models with promising results. However, it suffers from high computational complexity and weak capability in modeling local information. In contrast, the convolutional neural network (CNN) is computationally effective and superior in learning local information. Whereas it fails in self-interaction and capturing long-range dependence among input tokens. Accordingly, we take their complementary advantages and propose a new module, namely self-convolution, to compensate for each individual limitations. Specifically, self-convolution generates convolution kernels at each token (to model local information) which are then used to convolve itself (for self-interaction). Moreover, we bring in global information during the generation of convolution kernel to enhance the learning of long-range dependencies. In this way, the advantages of self-attention and CNN are both utilized. We conduct rigorous experiments on LibriSpeech, Tedlium2, and AIShell1 datasets and demonstrate that our proposed self-convolution can achieve superior ASR performance than self-attention with less computational cost.