NON-AUTOREGRESSIVE TRANSFORMER WITH UNIFIED BIDIRECTIONAL DECODER FOR AUTOMATIC SPEECH RECOGNITION
Chuan-Fei Zhang, Yan Liu, Tian-Hao Zhang, Song-Lu Chen, Xu-Cheng Yin, Feng Chen
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:10:26
Non-autoregressive (NAR) transformer models have been studied intensively in automatic speech recognition (ASR), and many NAR transformer models is to use the causal mask to limit token dependencies. However, the causal mask is designed for the left-to-right decoding process of the non-parallel autoregressive (AR) transformer, which is inappropriate for the parallel NAR transformer since it ignores the right-to-left contexts. Some methods are proposed to utilize right-to-left contexts with an extra decoder, but these methods increase the model complexity. To tackle the above problems, we propose a new non-autoregressive transformer with a unified bidirectional decoder (NAT-UBD), which can simultaneously utilize left-to-right and right-to-left contexts for ASR. However, direct use of bidirectional contexts will cause information leakage, which means the decoder output can be affected by the character information of the input in the same position. To avoid information leakage, we propose a novel attention mask and modify vanilla queries, keys, and values matrices for NAT-UBD. Experimental results verify that NAT-UBD can achieve character error rates (CERs) of 5.0%/5.5% on the Aishell-1 dev/test sets, outperforming all previous NAR transformer models. Moreover, NAT-UBD can run 49.8? faster than the AR transformer baseline when decoding in a single step.