Hypothesis Stitcher For End-To-End Speaker-Attributed Asr On Long-Form Multi-Talker Recordings
Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:12:04
Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification. The E2E SA-ASR has shown significant improvement of speaker-attributed word error rate (SA-WER) for monaural overlapped speech consists of various number of speakers. However, it is known that E2E model sometimes suffered from degradation due to training / testing condition mismatches. Especially, it has not yet been investigated that whether the E2E SA-ASR model works well for very long recordings, which is longer than that in the training data. In this paper, we first explore the E2E SA-ASR for long-form multi-talker recordings while investigating a known decoding algorithm of long-form audio for single-speaker ASRs. We then propose a novel method, called hypothesis stitcher, that takes multiple hypotheses from short-segmented audio and outputs a fused single hypothesis. We propose several variants of model architectures for the hypothesis stitcher, and evaluate them by comparing with conventional decoding methods. In our evaluation with LibriSpeech and LibriCSS corpora, we show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.
Chairs:
Xiaodong Cui