Towards Accurate and Real-time End-of-speech Estimation

Yifeng Fan (University of Illinois at Urbana-Champaign); Colin Vaz (Amazon); Di He (Amazon); Jahn Heymann (Amazon); Viet Anh Trinh (Amazon); Zhe Zhang (Amazon); Venkatesh Ravichandran (Amazon)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We introduce a variant of the endpoint (EP) detection problem in automatic speech recognition (ASR), which we call the end-of-speech (EOS) estimation. Given an utterance, EOS estimation aims to identify the timestamp when the utterance waveform has fully decayed and is then used to measure the EP latency. Accurate EOS estimation is difficult in large-scale streaming audio scenarios due to the hefty traffic and hardware limitations. To this end, we develop an efficient and accurate framework by performing force alignment on the 1-best ASR hypothesis. In particular, we propose to use binarized states sequences for alignment, which yields an EOS estimation robust to ASR hypothesis, and the estimation error is reduced by 28% compared to aligning on phoneme states. In addition, we further observe a 30% error reduction by applying the intermediate-stage embeddings of the encoder as additional features to compute the binary probabilities.

Tags:

Word spotting, VAD, and other topics in speech recognition

Towards Accurate and Real-time End-of-speech Estimation

Yifeng Fan (University of Illinois at Urbana-Champaign); Colin Vaz (Amazon); Di He (Amazon); Jahn Heymann (Amazon); Viet Anh Trinh (Amazon); Zhe Zhang (Amazon); Venkatesh Ravichandran (Amazon)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Joint unsupervised and supervised learning for context-aware language identification

Join the IEEE Signal Processing Society