Alignment-Learning based single-step decoding for accurate and fast non-autoregressive speech recognition
Yonghe Wang, Rui Liu, Feilong Bao, Hui Zhang, Guanglai Gao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:04
Non-autoregressive transformer (NAT) based speech recognition models have gained more and more attention since they perform faster inference speed compared with autoregressive counterparts, especially when the single-step decoding is applied. However, the single-step decoding process with length prediction will suffer from the decoding stability problem and limited improvement for inference speed. To address this, in this paper, we propose an alignment learning based NAT model, named AL-NAT. Our idea is inspired by the fact that the encoder CTC output and the target sequence are monotonically related. Specifically, we design an alignment cost matrix between the CTC output tokens and the target tokens and define a novel alignment loss to minimize the distance between the alignment cost matrix and the ground truth monotonic alignment path. By eliminating the length prediction mechanism, our AL-NAT model achieves remarkable improvements in recognition accuracy and decoding speed. To learn the contextual knowledge to improve the decoding accuracy, we further add lightweight language model on both the encoder and decoder side. Our proposed method achieves WERs of 2.8%/6.3% and RTF of 0.011 on Librispeech test clean/other sets with a lightweight 3-gram LM, and a CER of 5.3% and RTF of 0.005 on Aishell1 without LM, respectively.