Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

Jeong Hun Yeo (Korea Advanced Institute of Science and Technology); Minsu Kim (KAIST); Yong Man Ro (KAIST)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase. 2) We design an audio temporal model to produce multi-temporal audio features capturing the context of neighboring words. In addition, to construct effective visual-to-audio mapping, the audio temporal models can generate audio features time-aligned with visual features. Through extensive experiments, we validate the effectiveness of the MTLAM achieving state-of-the-art performances on two public VSR datasets.

Tags:

Machine/deep learning methodologies for multimedia

Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

Jeong Hun Yeo (Korea Advanced Institute of Science and Technology); Minsu Kim (KAIST); Yong Man Ro (KAIST)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Abusive activity detection with multi-modality based on convolutional neural network

IMPROVING THE MODALITY REPRESENTATION WITH MULTI-VIEW CONTRASTIVE LEARNING FOR MULTIMODAL SENTIMENT ANALYSIS

Lyapunov-driven deep reinforcement learning for edge inference empowered by Reconfigurable Intelligent Surfaces

Join the IEEE Signal Processing Society