Adapting a self-supervised speech representation for noisy speech emotion recognition by using contrastive teacher-student learning
Seong-Gyun Leem (University of Texas at Dallas); Daniel Fulford (Boston University); JP Onnela (T.H. Chan School of Public Health Harvard University); David Gard (San Francisco State University); Carlos Busso (University of Texas at Dallas)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Studies have shown high performance in the speech emotion recognition (SER) task by fine-tuning a self-supervised speech representation model. Although this model can provide emotionally discriminative embedding in clean conditions, adapting it to a noisy target environment is still required when deployed on real-world applications. For adaptation, it is essential to balance between acquiring new knowledge from noisy speech and keeping the previous knowledge acquired during the pre-training and fine-tuning of the model. Therefore, we propose a contrastive teacher-student learning framework to retrain a self-supervised speech representation model for noisy SER. To keep the knowledge of the original model, we minimize the root mean square error between the clean embeddings from the original SER model and the noisy embeddings from the retrained model. To acquire the discriminative knowledge in the target noisy condition, we also minimize the InfoNCE loss by selecting the corresponding clean embedding as a positive sample and other noisy embeddings with different emotional labels as negative samples. Our experiment with the clean and noisy version of the MSP-Podcast corpus demonstrates that the contrastive teacher-student learning framework can significantly improve the performance of the model only trained with the clean speech in the target noisy condition for all the emotional attributes.