ASR-AWARE END-TO-END NEURAL DIARIZATION

Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:20

12 May 2022

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features de-rived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20% relative to the baseline.

Tags:

automatic speech recognition

multi-task learning

diarization

ASR-AWARE END-TO-END NEURAL DIARIZATION

Aparna Khare, Eunjung Han, Yuguang Yang, Andreas Stolcke

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

End-to-End Automatic Speech Recognition

Towards a Speech Version of ChatGPT

Neural Signal Interpretation for Spoken Communication

Join the IEEE Signal Processing Society