SPEECH EMOTION RECOGNITION VIA HETEROGENEOUS FEATURE LEARNING
Ke Liu (Northwest University); Dongya Wu (Northwest University); Dekui Wang (Northwest University); Jun Feng (Northwest University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Speech emotion recognition (SER) based on multi-view learning has made some progress on speaker-independent scenarios. However, the existing SER methods always rely on excessive feature views and ignore the importance of heterogeneous feature learning. In this paper, we propose a novel multi-level attention method to effectively learn the heterogeneous information from the hand-crafted feature (MFCC) and the feature (W2V2) extracted from the pre-trained model. Specifically, we first design an Attention based Multi-scale Low-level Feature (A-MLF) extractor to extract scale-specific emotion-related regions from MFCC. Then, the Multi-Unit Attention (MUA) module is used to simultaneously learn discriminative features in three different dimensions. Finally, a two-stage feature fusion strategy is used for joint representation space learning. We demonstrate our method on two speaker-independent validation strategies and interpret the SOTA performance by visualizing the feature distribution.