The WHU-Alibaba Audio-Visual Speaker Diarization System for the MISP 2022 Challenge
Ming Cheng (Duke Kunshan University); Haoxu Wang (Wuhan University); Ziteng Wang (Alibaba Group); Qiang Fu (Alibaba Group); Ming Li (Duke Kunshan University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper describes the system developed by the WHU-Alibaba team for the Multimodal Information Based Speech Processing (MISP) 2022 Challenge. We extend the Sequence-to-Sequence Target-Speaker Voice Activity Detection framework to simultaneously detect multiple speakers’ voice activities from audio-visual signals. The final system achieves a diarization error rate (DER) of 8.82% on the evaluation set of the competition database, which ranks 1st in the speaker diarization track of the MISP 2022, ICASSP Signal Processing Grand Challenge.