DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Wei Kang (Xiaomi Corp., Beijing, China); Zengwei Yao (Xiaomi Corp.); Fangjun Kuang (Xiaomi Corp.); Liyong Guo (Xiaomi Corp.); Xiaoyu Yang (Xiaomi Corp.); Long Lin (Xiaomi Corp. ); Piotr Żelasko (Johns Hopkins University); Daniel Povey (Johns Hopkins University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available.

Tags:

Acoustic modeling for automatic speech recognition

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Wei Kang (Xiaomi Corp., Beijing, China); Zengwei Yao (Xiaomi Corp.); Fangjun Kuang (Xiaomi Corp.); Liyong Guo (Xiaomi Corp.); Xiaoyu Yang (Xiaomi Corp.); Long Lin (Xiaomi Corp. ); Piotr Żelasko (Johns Hopkins University); Daniel Povey (Johns Hopkins University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

AN ADAPTER BASED MULTI-LABEL PRE-TRAINING FOR SPEECH SEPARATION AND ENHANCEMENT

AN ISOTROPY ANALYSIS FOR SELF-SUPERVISED ACOUSTIC UNIT EMBEDDINGS ON THE ZERO RESOURCE SPEECH CHALLENGE 2021 FRAMEWORK

Join the IEEE Signal Processing Society