Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Dongmei Wang (Microsoft); Xiong Xiao (Microsoft); Naoyuki Kanda (Microsoft); Takuya Yoshioka (Microsoft); Jian Wu (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to make the model output insensitive to the order of the speaker profiles provided to the TS-VAD model. Time-wise sequential layers are interspersed between these speaker-wise transformer layers to allow the temporal and cross-speaker correlations of the input speech signal to be captured. We also extend a diarization model based on end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA) by replacing its dot-product-based speaker detection layer with the transformer-based TS-VAD. Experimental results on VoxConverse show that using the transformers for the cross-speaker modeling reduces the diarization error rate (DER) of TS-VAD by 11.3%, achieving a new state-of-the-art (SOTA) DER of 4.57%. Also, our extended EEND-EDA reduces DER by 6.9% on the CALLHOME dataset relative to the original EEND-EDA with a similar model size, achieving a new SOTA DER of 11.18% under a widely used training data setting.

Tags:

Word spotting, VAD, and other topics in speech recognition

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

Dongmei Wang (Microsoft); Xiong Xiao (Microsoft); Naoyuki Kanda (Microsoft); Takuya Yoshioka (Microsoft); Jian Wu (Microsoft)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

WeKws: A production first small-footprint end-to-end Keyword Spotting Toolkit

Join the IEEE Signal Processing Society