Looking Enhances Listening: Recovering Missing Speech Using Images

Tejas Srinivasan, Ramon Sanabria, Florian Metze

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 15:56

04 May 2020

Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to the noise by leveraging the visual context.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Tejas Srinivasan, Ramon Sanabria, Florian Metze

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society