THE SJTU SYSTEM FOR MULTIMODAL INFORMATION BASED SPEECH PROCESSING CHALLENGE 2021

Wei Wang, Xun Gong, Yifei Wu, Zhikai Zhou, Chenda Li, Wangyou Zhang, Bing Han, Yanmin Qian

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:06:10

07 May 2022

This paper describes the SJTU system for ICASSP Multi-modal Information based Speech Processing Challenge (MISP) 2021. To solve the speech recognition problem in real complex environments where time-synchronized near- and far-field signals are available for training an enhancement frontend. We build a joint system with speech enhancement frontend and speech recognition backend. These two modules are optimized jointly by both ASR and enhancement criteria. Audio-visual fusion is explored to further boost the ASR performance. ROVER and test time augmentation techniques are used to combine recognition results from multiple systems. The final system achieves Chinese character error rates (CCER) of 34.9% on dev set and 34.0% on test set, which achieved third place in the MISP challenge. The absolute CCER reduction compared with the official baseline system is 26.9% on dev set and 28.7% on test set.

Tags:

speech recognition

multi-modality

end-to-end