THE USTC-XIMALAYA SYSTEM FOR THE ICASSP 2022 MULTI-CHANNEL MULTI-PARTY MEETING TRANSCRIPTION (M2MET) CHALLENGE
Maokui He, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Jun Du, Xiang Lv, Weilin Zhou, Jingjing Yin, Yuhang Cao, Heng Lu, Chin-Hui Lee
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:07
Target-speaker voice activity detection (TS-VAD) has shown promising results for speaker diarization on multi-speaker conversations in real scenarios. In this paper, we perform the former work in Mandarin meeting scenario with heavy reverb and noise, and a high overlap ratio. Firstly, fully prepared data containing both real meetings and simulated indoor conversations are utilized to train the TS-VAD model. Then, we perform a powerful post-processing strategy by thresholding, merging the two segments with a short silence interval, deleting the silent speech segments that are silence and labeling the silent segments with the longest talking person nearby that are speech according to golden speech labels. Finally, the DOVER-Lap of multi-channel results brings another improvement. Through experiments on ALIMEETING corpus, the newly released Mandarin meeting dataset, we demonstrate that our method can decrease the DER by up to 66.55/60.59% relatively compared with classical clustering based diarization on the Eval/Test set.