Multi-Speaker End-to-end Multi-modal Speaker Diarization System for the MISP 2022 CHALLENGE
Tao Liu (Shanghai Jiao Tong University); Zhengyang Chen (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University); Kai Yu (Shanghai Jiao Tong University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper presents the design and implementation of our system for Track 1 of the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. We design an end-to-end transformer-based multi-talker system. The transformer backbone is well-suited to capture long-term features, which is crucial for multi-modal speaker diarization in cases where temporal modalities are missing. Besides, we employ several loss functions and image data augmentation techniques to prevent over-fitting during training. Moreover, to further improve the system's performance, we incorporate Inter-channel Phase Difference (IPD) to model the location features and pre-train an ECAPA-TDNN-based model to extract speaker embedding features. Our system achieved a diarization error rate (DER) of 10.82% on the evaluation set, which earned us second place in the audio-visual speaker diarization task of the MISP 2022 challenge.