CHANNEL-WISE AV-FUSION ATTENTION FOR MULTI-CHANNEL AUDIO-VISUAL SPEECH RECOGNITION
Gaopeng Xu, Song Yang, Wei Li, Sang Wang, Wei Guo, Junfeng Yuan, Jie Gao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:35
In this paper, we present our work for automatic speech recognition (ASR) in the Multimodal Information Based Speech Processing (MISP) Challenge 2021. We proposed a combination of the guided source separation-based (GSS) speech enhancement technique and a novel Channel-wise Av-fusion encoder (CAE) based acoustic model and found that a kindly combination of these techniques provided essential accuracy improvements. Our ASR system reduces the Chinese Character Error Rate (CCER) by 37.67% absolute compared to the baseline in track 2, achieving first place in the evaluation period with the CCER of 25.07%.