E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

W. Ronny Huang (Google); Shuo-yiin Chang (Google); Tara Sainath (Google); Yanzhang He (Google); David Rybach (Google); Robert David (Google); Rohit Prabhavalkar (Google); Cyril Allauzen (Google); Charles C Peyser (Google Inc.); Trevor Strohman (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We explore unifying a neural segmenter with two-pass cascaded encoder ASR into a single model. A key challenge is allowing the segmenter (which runs in real-time, synchronously with the decoder) to finalize the non-causal 2nd pass (which runs 900 ms behind real-time) without introducing user-perceived latency or deletion errors during inference. We propose a design where the neural segmenter is integrated with the causal 1st pass decoder to emit a end-of-segment (EOS) signal in real-time. The EOS signal is then used to finalize the non-causal 2nd pass. We experiment with different ways to finalize the 2nd pass, and find that a dummy frame injection strategy allows for simultaneous high quality 2nd pass results and low finalization latency. On a real-world long-form captioning task (YouTube), we achieve 2.4% relative WER and 140 ms EOS latency gains over a baseline VAD-based segmenter with the same cascaded encoder.

Tags:

Word spotting, VAD, and other topics in speech recognition

E2E Segmentation in a Two-Pass Cascaded Encoder ASR Model

W. Ronny Huang (Google); Shuo-yiin Chang (Google); Tara Sainath (Google); Yanzhang He (Google); David Rybach (Google); Robert David (Google); Rohit Prabhavalkar (Google); Cyril Allauzen (Google); Charles C Peyser (Google Inc.); Trevor Strohman (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

WeKws: A production first small-footprint end-to-end Keyword Spotting Toolkit

Join the IEEE Signal Processing Society