Towards Accurate and Real-time End-of-speech Estimation
Yifeng Fan (University of Illinois at Urbana-Champaign); Colin Vaz (Amazon); Di He (Amazon); Jahn Heymann (Amazon); Viet Anh Trinh (Amazon); Zhe Zhang (Amazon); Venkatesh Ravichandran (Amazon)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
We introduce a variant of the endpoint (EP) detection problem in automatic speech recognition (ASR), which we call the end-of-speech (EOS) estimation. Given an utterance, EOS estimation aims to identify the timestamp when the utterance waveform has fully decayed and is then used to measure the EP latency. Accurate EOS estimation is difficult in large-scale streaming audio scenarios due to the hefty traffic and hardware limitations. To this end, we develop an efficient and accurate framework by performing force alignment on the 1-best ASR hypothesis. In particular, we propose to use binarized states sequences for alignment, which yields an EOS estimation robust to ASR hypothesis, and the estimation error is reduced by 28% compared to aligning on phoneme states. In addition, we further observe a 30% error reduction by applying the intermediate-stage embeddings of the encoder as additional features to compute the binary probabilities.