MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION

David Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:57

08 May 2022

Traditionally, research in automated speech recognition has focused on local-first encoding of audio representations to predict the spoken phonemes in an utterance. Unfortunately, approaches relying on such hyper-local information tend to be vulnerable to both local-level corruption (such as audio-frame drops, or loud noises) and global-level noise (such as environmental noise, or background noise) that has not been seen during training. In this work, we introduce a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs. We then use a new deep-fusion framework to integrate this global context into a traditional ASR method, and demonstrate that the resulting method can outperform baseline methods by up to 7% on Librispeech; gains on internal datasets range from 6% (on larger models) to 45% (on smaller models).

Tags:

conformer

multi-modal learning

automated speech recognition

video

bert

MULTI-MODAL PRE-TRAINING FOR AUTOMATED SPEECH RECOGNITION

David Chan, Shalini Ghosh, Debmalya Chakrabarty, Björn Hoffmeister

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Keynote: Navigating the Transition to Sustainable Energy Solutions in a Power-Hungry World

Panel: Leveraging Technology to Achieve Carbon Neutrality of Buildings and Factories

Panel: Charting the Course for Future-Ready Data Centers in the Era of Sustainability

Join the IEEE Signal Processing Society