Multi-modal ASR error correction with joint ASR error detection
Binghuai Lin (MIG, Tencent Science and Technology Ltd.); Liyuan wang (Tencent Technology Co., Ltd)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
To tackle the recognition error problem for Automatic speech recognition (ASR), one common approach is to utilize text-based ASR error correction methods focusing on text error patterns. To include the audio information for better error correction, we propose a sequence-to-sequence multi-modal ASR error correction model. The multi-modal representations from pre-trained audio and text encoders are fused and aligned based on an attention mechanism. The decoder then searches for the corresponding correction results based on the fused representations. To better explore the correlations between different modalities, an additional ASR error detection task is applied on top of the fused representations. We optimize the network by a multi-task learning method combining both ASR error detection and correction tasks. Experimental results based on a 200-hour dataset recorded by Chinese English-as-second-language (ESL) learners show the proposed correction model can achieve significant improvement compared to the baselines with or without other ASR error correction methods.