MULTIPLE ACOUSTIC FEATURES SPEECH EMOTION RECOGNITION USING CROSS-ATTENTION TRANSFORMER
Yurun He (The University of Tokyo); Nobuaki Minematsu (The University of Tokyo); Daisuke Saito (The University of Tokyo)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speech emotion recognition (SER) is a challenging task whose performance heavily relies on suitable affect-salient representations. Recently, transformer has exhibited outstanding qualities in learning relevant representations associated with this task. However, a normal transformer is only able to process the uni-source input, and there is often only one kind of input feature in a transformer-based SER system, which may cause limited knowledge. In this paper, we attempt to use the cross-attention transformer (CAT) to handle bi-source input. We propose a novel SER system to better fuse three types of acoustic features -- raw waveform data, spectrogram, and MFCC using CAT. Experiments conducted on the IEMOCAP benchmark dataset have shown that our proposed system can achieve a 73.80% weighted accuracy (WA) and 74.25% unweighted accuracy (UA), which outperforms existing state-of-the-art approaches.