Skip to main content

Audio-Visual Tracking of Multiple Speakers via a PMBM Filter

Jinzheng Zhao, Peipei Wu, Xubo Liu, Wenwu Wang, Yong Xu, Lyudmila Mihaylova, Simon Godsill

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:10:20
11 May 2022

Audio-visual tracking of multiple speakers requires to estimate the state (e.g. velocity and location) of each speaker by leveraging the information of both audio and visual modalities. Estimating the number of speakers and their states jointly remains a challenging problem. We propose an Audio-Visual Possion Multi-Bernoulli Mixture Filter (AV-PMBM) that can not only predict the number of speakers but also give accurate estimation of their states. We also propose a novel sound source localization technique based on DOA information and a deep learning based object detector to provide reliable audio measurements for the AV tracker. To our knowledge, this represents the first attempt using PMBM for multi-speaker tracking with audio visual modalities. Experiments on the AV16.3 dataset demonstrate that AV-PMBM achieves state-of-the-art performance in optimal sub-pattern assignment (OSPA).