Skip to main content

MULTIPLE ACOUSTIC FEATURES SPEECH EMOTION RECOGNITION USING CROSS-ATTENTION TRANSFORMER

Yurun He (The University of Tokyo); Nobuaki Minematsu (The University of Tokyo); Daisuke Saito (The University of Tokyo)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
06 Jun 2023

Speech emotion recognition (SER) is a challenging task whose performance heavily relies on suitable affect-salient representations. Recently, transformer has exhibited outstanding qualities in learning relevant representations associated with this task. However, a normal transformer is only able to process the uni-source input, and there is often only one kind of input feature in a transformer-based SER system, which may cause limited knowledge. In this paper, we attempt to use the cross-attention transformer (CAT) to handle bi-source input. We propose a novel SER system to better fuse three types of acoustic features -- raw waveform data, spectrogram, and MFCC using CAT. Experiments conducted on the IEMOCAP benchmark dataset have shown that our proposed system can achieve a 73.80% weighted accuracy (WA) and 74.25% unweighted accuracy (UA), which outperforms existing state-of-the-art approaches.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00