Ensemble knowledge distillation of self-supervised speech models

Kuan-Po Huang (National Taiwan University); Tzu-hsun Feng (National Taiwan University); YU-KUAN FU (NTU); Tsu-Yuan Hsu (National Taiwan University); Po-Chieh Yen (National Taiwan University); Wei-Cheng Tseng (National Taiwan University); Kai-Wei Chang (National Taiwan University); Hung-yi Lee (National Taiwan University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Distilled self-supervised models have shown competitive performance and efficiency in recent years. However, there is a lack of experience in jointly distilling multiple self-supervised speech models. In our work, we performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM. We tried two different aggregation techniques, layerwise-average and layerwise-concatenation, to the representations of different teacher models and found that the former was more effective. On top of that, we proposed a multiple prediction head method for student models to predict different layer outputs of multiple teacher models simultaneously. The experimental results show that our method improves the performance of the distilled models on four downstream speech processing tasks, Phoneme Recognition, Speaker Identification, Emotion Recognition, and Automatic Speech Recognition in the hidden-set track of the SUPERB benchmark.

Tags:

Resource constrained speech recognition

Ensemble knowledge distillation of self-supervised speech models

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Improving Accented Speech Recognition with Multi-Domain Training

Papez: Resource-efficient Speech Separation with Auditory Working Memory

DOMAIN AND LANGUAGE ADAPTATION USING HETEROGENEOUS DATASETS FOR WAV2VEC2.0-BASED SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGE

Join the IEEE Signal Processing Society