Skip to main content

LEARNING TO PREDICT SPEECH IN SILENT VIDEOS VIA AUDIOVISUAL ANALOGY

Ravindra Yadav, Vinay Namboodiri, Rajesh Hegde, Ashish Sardana

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:05
12 May 2022

Lipreading is a difficult task, even for humans. And synthesizing the original speech waveform from lipreading makes it even a more challenging problem. Towards this end, we present a deep learning framework that can be trained end-to-end to learn the mapping between the auditory and visual signals. In particular, in this paper, our interest is to design a model that can efficiently predict the speech signal in a given silent talking-face video. The proposed framework generates a speech signal by mapping the video frames in a sequence of feature vectors. However, unlike some recent methods that adopt a sequence-to-sequence approach for translation from the frame stream to the audio stream, we posit it as an analogy learning problem between the two modalities. In which each frame is mapped to the corresponding speech segment via a deep audio-visual analogy framework. We predict plausible audio stream by training adversarially against a discriminator network. Our experiments, both qualitative and quantitative, on the publicly available GRID dataset show that the proposed method outperforms prior work on existing evaluation benchmarks. Our user studies confirm that our generated samples are more natural and closely match the ground truth speech signal.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00