PREDICTING MULTI-CODEBOOK VECTOR QUANTIZATION INDEXES FOR KNOWLEDGE DISTILLATION
Liyong Guo (Northwestern Polytechnical University); Xiaoyu Yang (Xiaomi Corp., Beijing); Quandong Wang (Xiaomi Corp., Beijing); Yuxiang Kong (Xiaomi Corp., Beijing); Zengwei Yao (Xiaomi Corp., Beijing); fan cui (xiaomi); Fangjun Kuang (Xiaomi Corp., Beijing); Wei Kang (Xiaomi Corp., Beijing, China); Long Lin (Xiaomi Corp., Beijing); Mingshuang Luo (Xiaomi Corp., Beijing); Piotr Żelasko (Johns Hopkins University); Daniel Povey (Johns Hopkins University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Knowledge distillation (KD) is a common approach to improve
model performance in automatic speech recognition (ASR), where a
student model is trained to imitate the output behaviour of a teacher
model. However, traditional KD methods suffer from teacher la-
bel storage issue, especially when the training corpora are large.
Although on-the-fly teacher label generation tackles this issue, the
training speed is significantly slower as the teacher model has to
be evaluated every batch. In this paper, we reformulate the gen-
eration of teacher label as a codec problem. We propose a novel
Multi-codebook Vector Quantization (MVQ) approach that com-
presses teacher embeddings to codebook indexes (CI). Based on
this, a KD training framework (MVQ-KD) is proposed where a
student model predicts the CI generated from the embeddings of a
self-supervised pre-trained teacher model. Experiments on the Lib-
riSpeech clean-100 hour show that MVQ-KD framework achieves
comparable performance as traditional KD methods (l1, l2), while
requiring 256 times less storage. When the full LibriSpeech dataset
is used, MVQ-KD framework results in 13.8% and 8.2% relative
word error rate reductions (WERRs) for non-streaming transducer
on test-clean and test-other and 4.0% and 4.9% for streaming trans-
ducer. The implementation of this work is already released as a part
of the open-source project icefall.