SPEECH EMOTION RECOGNITION BASED ON LOW-LEVEL AUTO-EXTRACTED TIME-FREQUENCY FEATURES
Ke Liu (Northwest University); Jingzhao Hu (Northwest University); Jun Feng (Northwest University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Deep-learning based methods that aim to extract effective high-level features have steadily improved the performance on the speech emotion recognition. However, low-level features that contain important emotion-related information have not gained much attention. In this paper, we propose a novel low-level feature extraction method based on the Time-Frequency Attention (TFA) module and Time-Frequency Weighting (TFW) module. First, the TFA module is designed to learn notable regions in the detail-rich low-level feature maps produced by the scale-specific convolutional layers. Then, the TFW module is proposed to extract discriminative features from the time and frequency dimensions respectively. Finally, the speech emotion recognition task is completed by the subsequent multi-branch network. Experimental results on the IEMOCAP and RAVDESS datasets demonstrate the importance of low-level features, and show that the proposed method outperforms other state-of-the-art approaches.