Audio-visual speaker diarization in the framework of multi-user human-robot interaction
Timothée Dhaussy (Université Avignon); Bassam Jabaian (LIA - Avignon university); Fabrice Lefevre (Univ. Avignon); Radu Horaud (Inria)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speaker diarization task answers the question "who is speaking at a given time?". It represents valuable information for scene analysis in a domain such as robotics. In this paper, we introduce a temporal audio visual fusion model for multi-users speaker diarization, with low computing requirement, a good robustness and an absence of training. In this study, the diarization is represented as a latent-variable temporal graphical model that assigns audio sources to individuals. The experimental setup is carefully tested and benchmarked through two different datasets. Our experimental study shows that the proposed solution yields promising results.