Emotion Recognition In Public Speaking Scenarios Utilising An Lstm-Rnn Approach With Attention
Alice Baird, Shahin Amiriparian, Manuel Milling, Bj枚rn Schuller
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:15:09
Speaking in public can be a cause of fear for many people. Research suggests that there are physical markers such as an increased heart rate and vocal tremolo that indicate an individual's state of wellbeing during a public speech. In this study, we explore the advantages of speech-based features for continuous recognition of the emotional dimensions of arousal and valence during a public speaking scenario. Furthermore, we explore biological signal fusion, and perform cross-language (German and English) analysis by training language-independent models and testing them on speech from various native and non-native speaker groupings. For the emotion recognition task itself, we utilise a Long Short-Term Memory - Recurrent Neural Network (LSTM-RNN) architecture with a self-attention layer. When utilising audio-only features and testing with non-native German's speaking German we achieve at best a concordance correlation coefficient (CCC) of 0.640 and 0.491 for arousal and valence, respectively - demonstrating a strong effect for this task from non-native speakers, as well as promise for the suitability of deep learning for continuous emotion recognition in the context of public speaking.