Skip to main content

FRONTEND ATTRIBUTES DISENTANGLEMENT FOR SPEECH EMOTION RECOGNITION

Yu-Xuan Xi, Yan Song, Li-Rong Dai, Ian McLoughlin, Lin Liu

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:12:54
11 May 2022

Speech emotion recognition (SER) with limited size dataset is a challenging task, since a spoken utterance contains various disturbing attributes besides emotion, including speaker, content, and language etc. However, due to a close relationship between speaker and emotion attributes, simply fine-tuning a linear model is enough to obtain a good SER performance on the utterance-level embeddings (i.e., i-vector and x-vectors) extracted from the pre-trained speaker recognition(SR) frontends. In this paper, we aim to perform frontend attributes disentanglement (AD) for SER task, using a pre-trained SR model. Specifically, the AD module consists of attribute normalization (AN) and attribute reconstruction (AR) phases is proposed. The AN filters out the variation information using instance normalization (IN), and AR reconstructs the emotion-relevant features from the residual space to ensure the high emotion discrimination. For better disentanglement, an dual space loss is then designed to encourage the separability of emotion-relevant and emotion-irrelevant spaces. To introduce the long-range contextual information for emotion related reconstruction, a time-frequency(TF) attention is further proposed. Different from the style disentanglement of the extracted x-vectors, the proposed AD module can be applied on frontend feature extractor. Experiments on IEMOCAP benchmark demonstrate the effectiveness of the proposed method.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00