ASD-TRANSFORMER: EFFICIENT ACTIVE SPEAKER DETECTION USING SELF AND MULTIMODAL TRANSFORMERS

Gourav Datta, Tyler Etchart, Vivek Yadav, Varsha Hedau, Pradeep Natarajan, Shih-Fu Chang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:18

13 May 2022

Multimodal active speaker detection (ASD) methods assign a speaking/not-speaking label per individual in a video clip. ASD is critical for applications such as natural human-computer interaction, speaker diarization, and video re-framing. Recent work has shown the success of transformers in multimodal settings, thus we propose a novel framework that leverages modern transformer and concatenation mechanisms to efficiently capture the interaction between audio and video modalities for ASD. We achieve mAP similar to state-of-the-art (93.0% vs 93.5%) on the AVA-ActiveSpeaker dataset. Further, our model has ~3x smaller size (15.23MB vs 49.82MB), reduced FLOPs count (11.8 vs 14.3), and lower training time (15h vs 38h). To verify our model is making predictions from the right visual cues, we computed saliency maps over input images. We found that in addition to mouth regions, the nose, cheek, and area under the eye were helpful in identifying active speakers. Our ablation study reveals that the mouth region alone achieved lower mAP (91.9% vs 93.0%) compared to full face region, supporting our hypothesis that facial expressions in addition to mouth region are useful for ASD.

Tags:

active speaker detection

human-computer interaction

multimodal

saliency maps

transformer