-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 01:01:49
Perception systems that can both see and hear have great potential to unlock real-world video understanding. When the two modalities work together, they can improve data efficiency for machine learning algorithms by connecting the dots between the interacting signals. I will present our recent work exploring audio-visual video analysis in terms of both semantic and spatial perception. First, we consider visually-guided audio source separation: given video with multiple sounding objects, which sounds come from which visual objects? The proposed methods can focus on a human speakerƒ??s voice amidst busy ambient sounds, split the sounds of multiple instruments playing simultaneously, or simply provide a semantic prior for the category of a visible object. Then, turning to activity recognition, we leverage audio as a fast ƒ??previewƒ?� for an entire video clip in order to concentrate expensive visual feature computation where it is most needed. Finally, moving from those semantic tasks to spatial audio understanding, we introduce approaches for self-supervised feature learning that leverage sounds heard during training to embed geometric cues into visual encoders. The resulting representations benefit spatially grounded tasks like depth estimation, immersive 3D sound generation for video, and even audio source separation.