Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:59
04 May 2020

Deep neural network based audio embeddings (d-vectors) have demonstrated superior performance in audio-only speaker diarization compared to traditional acoustic features such as mel-frequency cepstral coefficients (MFCCs) and i-vectors. However, there has been little work on multimodal diarization systems that combine d-vectors with additional sources of information. In this paper, we present a novel approach to multimodal speaker diarization that combines d-vectors with spatial information derived from performing beamforming given a multi-channel microphone array. Our system performs spectral clustering on a combination of speaker embeddings and spatial features that are computed using the Steered-Response Power Phase Transform (SRP-PHAT) algorithm. We evaluate our system on the AMI Meeting Corpus and an internal dataset of real-world conversations. By using both acoustic and spatial features for diarization, we achieve significant improvements over a d-vector only baseline and show potential to achieve comparable results with other state-of-the-art multimodal diarization systems.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00