Reducing the gap between streaming and non-streaming Transducer-based ASR models by adaptive two-stage knowledge distillation

Haitao Tang (iFlytek Research); Yu Fu (Zhejiang University); Lei Sun (iFlytek Research); Jiabin Xue (Harbin Institute of Technology); Dan Liu (iFLYTEK Co., LTD.,); Yongchao Li (iFlytek Research); Zhiqiang Ma (iFlytek Research); Minghui Wu (iFlytek Research); Jia Pan (iFlytek Research); Genshun Wan (iFlytek Research); Ming'en Zhao (iFlytek Research)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. But it is difficult to ensure the distribution consistency simultaneously since the learning of the output distribution is depended on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.

Tags:

Large vocabulary continuous speech recognition/search

Reducing the gap between streaming and non-streaming Transducer-based ASR models by adaptive two-stage knowledge distillation

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ROBUST ACOUSTIC AND SEMANTIC CONTEXTUAL BIASING IN NEURAL TRANSDUCERS FOR SPEECH RECOGNITION

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

CommDRE:Document-level Relation Extraction with Self-supervised Commonsense Learning

Join the IEEE Signal Processing Society