Improving Convolutional Recurrent Neural Networks For Speech Emotion Recognition
Patrick Meyer, Ziyi Xu, Tim Fingscheidt
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:14:48
Deep learning has increased the interest in speech emotion recognition (SER) and has put forth diverse structures and methods to improve performance. In recent years it has turned out that applying SER on a (log-mel) spectrogram and thus, interpreting SER as an image recognition task is a promising method. Following the trend towards using a convolutional neural network (CNN) in combination with a bidirectional long short-term memory (BLSTM) layer, and some subsequent fully connected layers, in this work, we advance the performance of this topology by several contributions: We integrate a multi-kernel width CNN, propose a BLSTM output summarization function, apply an enhanced feature representation, and introduce an effective training method. In order to foster insight into our proposed methods, we separately evaluate the impact of each modification in an ablation study. Based on our modifications, we obtain top results for this type of topology on IEMOCAP with an unweighted average recall of 64.5 % on average.