Skip to main content

Self-Supervised Learning For Audio-Visual Speaker Diarization

Yifan Ding, Yong Xu, Shi-Xiong Zhang, Yahuan Cong, Liqiang Wang

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:18
04 May 2020

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to solve the speaker diarization problem without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We also introduce a new way to measure unsynchronization distance, which enables real-time inference comparing with the long latency of previous methods. A new large scale audio-video corpus was designed to fill the vacancy of audio-video dataset in Chinese. The experimental results show that our proposed method yields a remarkable gain of +8\% on a real-world human-computer interaction system.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00