MULTIPLE ACOUSTIC FEATURES SPEECH EMOTION RECOGNITION USING CROSS-ATTENTION TRANSFORMER

Yurun He (The University of Tokyo); Nobuaki Minematsu (The University of Tokyo); Daisuke Saito (The University of Tokyo)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Speech emotion recognition (SER) is a challenging task whose performance heavily relies on suitable affect-salient representations. Recently, transformer has exhibited outstanding qualities in learning relevant representations associated with this task. However, a normal transformer is only able to process the uni-source input, and there is often only one kind of input feature in a transformer-based SER system, which may cause limited knowledge. In this paper, we attempt to use the cross-attention transformer (CAT) to handle bi-source input. We propose a novel SER system to better fuse three types of acoustic features -- raw waveform data, spectrogram, and MFCC using CAT. Experiments conducted on the IEMOCAP benchmark dataset have shown that our proposed system can achieve a 73.80% weighted accuracy (WA) and 74.25% unweighted accuracy (UA), which outperforms existing state-of-the-art approaches.

Tags:

Speech analysis and Language disorder Analysis

MULTIPLE ACOUSTIC FEATURES SPEECH EMOTION RECOGNITION USING CROSS-ATTENTION TRANSFORMER

Yurun He (The University of Tokyo); Nobuaki Minematsu (The University of Tokyo); Daisuke Saito (The University of Tokyo)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

A Generalized Subspace Distribution Adaptation Framework for Cross-Corpus Speech Emotion Recognition

Leveraging Pretrained Representations with Task-related Keywords for Alzheimer's Disease Detection

Wav2vec-based Detection and Severity Level Classification of Dysarthria from Speech

Join the IEEE Signal Processing Society