Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning

Seong-Gyun Leem (University of Texas at Dallas); Daniel Fulford (Boston University); JP Onnela (T.H. Chan School of Public Health Harvard University); David Gard (San Francisco State University); Carlos Busso (University of Texas at Dallas)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Studies have shown high performance in the speech emotion recognition (SER) task by fine-tuning a self-supervised speech representation model. Although this model can provide emotionally discriminative embedding in clean conditions, adapting it to a noisy target environment is still required when deployed on real-world applications. For adaptation, it is essential to balance between acquiring new knowledge from noisy speech and keeping the previous knowledge acquired during the pre-training and fine-tuning of the model. Therefore, we propose a contrastive teacher-student learning framework to retrain a self-supervised speech representation model for noisy SER. To keep the knowledge of the original model, we minimize the root mean square error between the clean embeddings from the original SER model and the noisy embeddings from the retrained model. To acquire the discriminative knowledge in the target noisy condition, we also minimize the InfoNCE loss by selecting the corresponding clean embedding as a positive sample and other noisy embeddings with different emotional labels as negative samples. Our experiment with the clean and noisy version of the MSP-Podcast corpus demonstrates that the contrastive teacher-student learning framework can significantly improve the performance of the model only trained with the clean speech in the target noisy condition for all the emotional attributes.

Tags:

Robust speech recognition and adaptation

Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning

Seong-Gyun Leem (University of Texas at Dallas); Daniel Fulford (Boston University); JP Onnela (T.H. Chan School of Public Health Harvard University); David Gard (San Francisco State University); Carlos Busso (University of Texas at Dallas)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DATA2VEC-AQC: SEARCH FOR THE RIGHT TEACHING ASSISTANT IN THE TEACHER-STUDENT TRAINING SETUP

BENCHMARK OF PHYSIOLOGICAL MODEL BASED AND DEEP LEARNING BASED REMOTE PHOTOPLETHYSMOGRAPHY IN AUTOMOTIVE

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Join the IEEE Signal Processing Society