On The Use Of Self-Supervised Pre-Trained Acoustic And Linguistic Features For Continuous Speech Emotion Recognition

Manon Macary, Marie Tahon, Yannick Est猫ve, Anthony Rousseau

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 0:14:49

19 Jan 2021

Pre-training for feature extraction is an increasingly studied approach to get better continuous representations of audio and text content. In the present work, we use wav2vec and camemBERT as self-supervised learned models to represent our data in order to perform continuous emotion recognition from speech (SER) on AlloSat, a large French emotional database describing the satisfaction dimension, and on the state of the art corpus SEWA focusing on valence, arousal and liking dimensions. To the authors鈥 knowledge, this paper presents the first study showing that the joint use of wav2vec and BERT-like pre-trained features is very relevant to deal with continuous SER task, usually characterized by a small amount of labeled training data. Evaluated by the well-known concordance correlation coefficient (CCC), our experiments show that we can reach a CCC value of 0.825 instead of 0.592 when using MFCC in conjunction with word2vec word embeddings on the AlloSat dataset.

Tags:

sps conference

slt 2021

On The Use Of Self-Supervised Pre-Trained Acoustic And Linguistic Features For Continuous Speech Emotion Recognition

Manon Macary, Marie Tahon, Yannick Est猫ve, Anthony Rousseau

Value-Added Bundle(s) Including this Product

SLT 2021 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society