Robust Audio-Visual ASR with Unified Cross-modal Attention

Jiahong Li (Shanghai Jiao Tong University); Chenda Li (Shanghai Jiao Tong University); Yifei Wu (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Audio-visual speech recognition (AVSR) takes advantage of noise-invariant visual information to improve the robustness of automatic speech recognition (ASR) systems. While previous works mainly focused on the clean condition, we believe the visual modality is more effective in noisy environments. The challenges arise from the difficulty of adaptive fusion of audio-visual information and the possible interferences inside the training data. In this paper, we present a new audio-visual speech recognition model with a unified cross-modal attention mechanism. In particular, the auxiliary visual evidence is combined with the acoustic feature along the temporal dimension in the unified space before the deep encoding network. This method provides a flexible cross-modal context and requires no forced alignment such that the model can learn to leverage the audio-visual information in relevant frames. In experiments, the proposed model is demonstrated to be robust to the potential absence of the visual modality or misalignment in audio-visual frames. On the large-scale audio-visual dataset LRS3, our new model further reduces the state-of-the-art WER for clean utterances and significantly improves the performance under noisy conditions.

Tags:

Acoustic modeling for automatic speech recognition

Robust Audio-Visual ASR with Unified Cross-modal Attention

Jiahong Li (Shanghai Jiao Tong University); Chenda Li (Shanghai Jiao Tong University); Yifei Wu (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

AN ADAPTER BASED MULTI-LABEL PRE-TRAINING FOR SPEECH SEPARATION AND ENHANCEMENT

Join the IEEE Signal Processing Society