Self-Supervised Learning For Audio-Visual Speaker Diarization
Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:18
Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to solve the speaker diarization problem without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We also introduce a new way to measure unsynchronization distance, which enables real-time inference comparing with the long latency of previous methods. A new large scale audio-video corpus was designed to fill the vacancy of audio-video dataset in Chinese. The experimental results show that our proposed method yields a remarkable gain of +8\% on a real-world human-computer interaction system.