Multi-Speaker End-to-end Multi-modal Speaker Diarization System for the MISP 2022 CHALLENGE

Tao Liu (Shanghai Jiao Tong University); Zhengyang Chen (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University); Kai Yu (Shanghai Jiao Tong University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

This paper presents the design and implementation of our system for Track 1 of the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. We design an end-to-end transformer-based multi-talker system. The transformer backbone is well-suited to capture long-term features, which is crucial for multi-modal speaker diarization in cases where temporal modalities are missing. Besides, we employ several loss functions and image data augmentation techniques to prevent over-fitting during training. Moreover, to further improve the system's performance, we incorporate Inter-channel Phase Difference (IPD) to model the location features and pre-train an ECAPA-TDNN-based model to extract speaker embedding features. Our system achieved a diarization error rate (DER) of 10.82% on the evaluation set, which earned us second place in the audio-visual speaker diarization task of the MISP 2022 challenge.

Tags:

Signal Processing for Communications and Networking