Hypothesis Stitcher For End-To-End Speaker-Attributed Asr On Long-Form Multi-Talker Recordings

Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:04

10 Jun 2021

Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification. The E2E SA-ASR has shown significant improvement of speaker-attributed word error rate (SA-WER) for monaural overlapped speech consists of various number of speakers. However, it is known that E2E model sometimes suffered from degradation due to training / testing condition mismatches. Especially, it has not yet been investigated that whether the E2E SA-ASR model works well for very long recordings, which is longer than that in the training data. In this paper, we first explore the E2E SA-ASR for long-form multi-talker recordings while investigating a known decoding algorithm of long-form audio for single-speaker ASRs. We then propose a novel method, called hypothesis stitcher, that takes multiple hypotheses from short-segmented audio and outputs a fused single hypothesis. We propose several variants of model architectures for the hypothesis stitcher, and evaluate them by comparing with conventional decoding methods. In our evaluation with LibriSpeech and LibriCSS corpora, we show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

Chairs:

Xiaodong Cui

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021