Self-Convolution for Automatic Speech Recognition

Tian-Hao Zhang (University of Science and Technology Beijing); Qi Liu (University of Science and Technology Beijing); Xinyuan Qian (USTB); Song-Lu Chen (University of Science and Technology); Feng Chen (EEasy Technology Co. LTD); Xu-Cheng Yin (University of Science and Technology Beijing)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Self-attention plays a significant role in recent automatic speech recognition (ASR) models with promising results. However, it suffers from high computational complexity and weak capability in modeling local information. In contrast, the convolutional neural network (CNN) is computationally effective and superior in learning local information. Whereas it fails in self-interaction and capturing long-range dependence among input tokens. Accordingly, we take their complementary advantages and propose a new module, namely self-convolution, to compensate for each individual limitations. Specifically, self-convolution generates convolution kernels at each token (to model local information) which are then used to convolve itself (for self-interaction). Moreover, we bring in global information during the generation of convolution kernel to enhance the learning of long-range dependencies. In this way, the advantages of self-attention and CNN are both utilized. We conduct rigorous experiments on LibriSpeech, Tedlium2, and AIShell1 datasets and demonstrate that our proposed self-convolution can achieve superior ASR performance than self-attention with less computational cost.

Tags:

Acoustic modeling for automatic speech recognition

Self-Convolution for Automatic Speech Recognition

Tian-Hao Zhang (University of Science and Technology Beijing); Qi Liu (University of Science and Technology Beijing); Xinyuan Qian (USTB); Song-Lu Chen (University of Science and Technology); Feng Chen (EEasy Technology Co. LTD); Xu-Cheng Yin (University of Science and Technology Beijing)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

AN ADAPTER BASED MULTI-LABEL PRE-TRAINING FOR SPEECH SEPARATION AND ENHANCEMENT

Join the IEEE Signal Processing Society