CHANNEL-WISE AV-FUSION ATTENTION FOR MULTI-CHANNEL AUDIO-VISUAL SPEECH RECOGNITION

Gaopeng Xu, Song Yang, Wei Li, Sang Wang, Wei Guo, Junfeng Yuan, Jie Gao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:35

07 May 2022

In this paper, we present our work for automatic speech recognition (ASR) in the Multimodal Information Based Speech Processing (MISP) Challenge 2021. We proposed a combination of the guided source separation-based (GSS) speech enhancement technique and a novel Channel-wise Av-fusion encoder (CAE) based acoustic model and found that a kindly combination of these techniques provided essential accuracy improvements. Our ASR system reduces the Chinese Character Error Rate (CCER) by 37.67% absolute compared to the baseline in track 2, achieving first place in the evaluation period with the CCER of 25.07%.

Tags:

channel-wise

audio-visual speech recognition

multimodal

guided source separation

Value-Added Bundle(s) Including this Product

22 May 2022

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

11 Oct 2023

LEVERAGING EFFICIENT TRAINING AND FEATURE FUSION IN TRANSFORMERS FOR MULTIMODAL CLASSIFICATION

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Oct 2023

MULTI-MODAL HIERARCHICAL ATTENTION-BASED DENSE VIDEO CAPTIONING

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

13 May 2022

MMLATCH: BOTTOM-UP TOP-DOWN FUSION FOR MULTIMODAL SENTIMENT ANALYSIS

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00