Self-Supervised Learning For Audio-Visual Speaker Diarization

Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 14:18

04 May 2020

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to solve the speaker diarization problem without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We also introduce a new way to measure unsynchronization distance, which enables real-time inference comparing with the long latency of previous methods. A new large scale audio-video corpus was designed to fill the vacancy of audio-video dataset in Chinese. The experimental results show that our proposed method yields a remarkable gain of +8\% on a real-world human-computer interaction system.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020