Alignment-Learning based single-step decoding for accurate and fast non-autoregressive speech recognition

Yonghe Wang, Rui Liu, Feilong Bao, Hui Zhang, Guanglai Gao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:04

13 May 2022

Non-autoregressive transformer (NAT) based speech recognition models have gained more and more attention since they perform faster inference speed compared with autoregressive counterparts, especially when the single-step decoding is applied. However, the single-step decoding process with length prediction will suffer from the decoding stability problem and limited improvement for inference speed. To address this, in this paper, we propose an alignment learning based NAT model, named AL-NAT. Our idea is inspired by the fact that the encoder CTC output and the target sequence are monotonically related. Specifically, we design an alignment cost matrix between the CTC output tokens and the target tokens and define a novel alignment loss to minimize the distance between the alignment cost matrix and the ground truth monotonic alignment path. By eliminating the length prediction mechanism, our AL-NAT model achieves remarkable improvements in recognition accuracy and decoding speed. To learn the contextual knowledge to improve the decoding accuracy, we further add lightweight language model on both the encoder and decoder side. Our proposed method achieves WERs of 2.8%/6.3% and RTF of 0.011 on Librispeech test clean/other sets with a lightweight 3-gram LM, and a CER of 5.3% and RTF of 0.005 on Aishell1 without LM, respectively.

Tags:

speech recognition

non-autoregressive transformer

alignment learning

Alignment-Learning based single-step decoding for accurate and fast non-autoregressive speech recognition

Yonghe Wang, Rui Liu, Feilong Bao, Hui Zhang, Guanglai Gao

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Tutorial: Foundational Problems in Neural Speech Recognition

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

On Language Model Integration for RNN Transducer based Speech Recognition

Join the IEEE Signal Processing Society