Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound

Valentina Sanguineti (Istituto Italiano di Tecnologia); Sanket Thakur (Istituto Italiano di Tecnologia); Pietro Morerio (Istituto Italiano di Tecnologia); Alessio Del Bue (Istituto Italiano di Tecnologia (IIT)); Vittorio Murino (Istituto Italiano di Tecnologia)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

We tackle audio-visual inpainting, the problem of completing an image in such a way to be consistent with the sound associated to the scene. To this end, we propose a multimodal, audio-visual inpainting method (AVIN), and show how to leverage sound to reconstruct semantically consistent images. AVIN is a 2-stage algorithm, which first learns the scene semantics and reconstructs low resolution images based on a conditional probability distribution of pixels in the space conditioned to audio, and then refines such result with a GAN-based network to increase the resolution of the reconstructed image. We show that AVIN is able to recover the original content, especially in the hard cases where the missing area heavily degrades the scene semantics: it can perform cross-modal generation whenever no visual context is observed at all, reconstructing visual data from sound only. Code will be made available upon acceptance.

Tags:

Self-supervised and semi-supervised learning

Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound

Valentina Sanguineti (Istituto Italiano di Tecnologia); Sanket Thakur (Istituto Italiano di Tecnologia); Pietro Morerio (Istituto Italiano di Tecnologia); Alessio Del Bue (Istituto Italiano di Tecnologia (IIT)); Vittorio Murino (Istituto Italiano di Tecnologia)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ACTIVE LEARNING FOR EFFICIENT FEW-SHOT CLASSIFICATION

Learning on Graphs under Label Noise

HINDI AS A SECOND LANGUAGE: IMPROVING VISUALLY GROUNDED SPEECH WITH SEMANTICALLY SIMILAR SAMPLES

Join the IEEE Signal Processing Society