Cross modal video representations for weakly supervised active speaker localization

Rahul Sharma (University of Southern California); Krishna Somandepalli (University of Southern California); Shrikanth Narayanan (University of Southern California)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. Furthermore, we propose an audio-assisted post-processing formulation for the task of active speaker detection. We evaluate the performance of the proposed system on three benchmark datasets: i) AVA active speaker dataset, ii) Visual person clustering dataset, and iii) Columbia datset, and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems.

Tags:

Signal Processing for Communications and Networking

Cross modal video representations for weakly supervised active speaker localization

Rahul Sharma (University of Southern California); Krishna Somandepalli (University of Southern California); Shrikanth Narayanan (University of Southern California)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

CONSEN: Complementary and Simultaneous Ensemble for Alzheimer's Disease Detection and MMSE Score Prediction

Signal Processing Grand Challenge 2023 - e-Prevention: Sleep Behavior as an Indicator of Relapses in Psychotic Patients

A Low-Latency Deep Hierarchical Fusion Network for Fullband Acoustic Echo Cancellation

Join the IEEE Signal Processing Society