Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Zhao Ren (L3S Research Center); Thanh Tam Nguyen (Griffith University); Yi Chang (Imperial College London); Bjoern W. Schuller (Imperial College London)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

Tags:

Speech analysis and Language disorder Analysis

Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Zhao Ren (L3S Research Center); Thanh Tam Nguyen (Griffith University); Yi Chang (Imperial College London); Bjoern W. Schuller (Imperial College London)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

A Generalized Subspace Distribution Adaptation Framework for Cross-Corpus Speech Emotion Recognition

Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Join the IEEE Signal Processing Society