Compressing Transformer-based ASR Model by Task-driven Loss and Attention-based Multi-level Feature Distillation

Yongjie Lv, Longbiao Wang, Meng Ge, Kiyoshi Honda, Sheng Li, Chenchen Ding, Lixin Pan, Yuguang Wang, Jianwu Dang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:10:53

12 May 2022

The current popular knowledge distillation (KD) methods effectively compress the transformer-based end-to-end speech recognition model. However, existing methods fail to utilize complete information of the teacher model, and they distill only a limited number of blocks of the teacher model. In this study, we first integrate a task-driven loss function into the decoder's intermediate blocks to generate task-related feature representations. Then, we propose an attention-based multi-level feature distillation to automatically learn the feature representation summarized by all blocks of the teacher model. Under the 1.1M parameters model, the experimental results on the Wall Street Journal dataset reveal that our approach achieves a 12.1% WER reduction compared with the baseline system.

Tags:

model compression

speech recognition

task-driven loss

feature distillation

transformer

Compressing Transformer-based ASR Model by Task-driven Loss and Attention-based Multi-level Feature Distillation

Yongjie Lv, Longbiao Wang, Meng Ge, Kiyoshi Honda, Sheng Li, Chenchen Ding, Lixin Pan, Yuguang Wang, Jianwu Dang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Tutorial: Foundational Problems in Neural Speech Recognition

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

Slides: Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Join the IEEE Signal Processing Society