Adaptive Knowledge Distillation between Text and Speech Pre-trained Models

Jinjie Ni (Nanyang Technological University); Yukun Ma (Alibaba Group); Wen Wang (Alibaba Group); Qian Chen (Speech Lab, DAMO Academy, Alibaba Group); Dianwen Ng (Alibaba Group/Nanyang Technological University); HAN LEI (Nanyang Technological University); Trung Hieu Nguyen (Alibaba Group); Chong Zhang (Alibaba Group); Bin Ma ("Alibaba, Singapore R&D Center"); Erik Cambria (Nanyang Technological University, Singapore)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Learning on a massive amount of speech corpus leads to the recent success of many self-supervised speech models. With knowledge distillation, these models may also benefit from the knowledge encoded by language models that are pre-trained on rich sources of texts. The distillation process, however, is challenging due to the modal disparity between textual and speech embedding spaces. This paper studies metric-based distillation to align the embedding space of text and speech with only a small amount of data without modifying the model structure. Since the semantic and granularity gap between text and speech has been omitted in literature, which impairs the distillation, we propose the Prior-informed Adaptive knowledge Distillation (PAD) that adaptively leverages text/speech units of variable granularity and prior distributions to achieve better global and local alignments between text and speech pre-trained models. We evaluate on three spoken language understanding benchmarks to show that PAD is more effective in transferring linguistic knowledge than other metric-based distillation approaches.

Tags:

Speech emotion detection and analysis

Adaptive Knowledge Distillation between Text and Speech Pre-trained Models

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

AN EMPIRICAL STUDY AND IMPROVEMENT FOR SPEECH EMOTION RECOGNITION

Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations

Tranferring Quantified Emotion Knowledge for the Detection of Depression in Alzheimer's Disease Using ForestNets

Join the IEEE Signal Processing Society