Robust multi-modal speech emotion recognition with ASR error adaptation
Binghuai Lin (MIG, Tencent Science and Technology Ltd.); Liyuan wang (Tencent Technology Co., Ltd)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Multi-modal speech emotion recognition (SER) brings performance improvement compared with single-modal systems. However, the ASR errors from text modality may deteriorate the SER performance. This paper proposes an SER method robust to ASR errors. We explore complementary semantic information from the audio to reduce the impact of ASR errors, which is done by an attention mechanism to calculate weighted acoustic representations. This information is fused with the text representations of ASR hypotheses utilizing an adaptive weight, which is determined by an auxiliary ASR error detection task. Finally, the fused text representations are concatenated with acoustic representations to perform SER. Results based on the public Emotional Dyadic Motion Capture (IEMOCAP) dataset show when using ASR hypotheses with high word error rate (WER), the proposed method is proved to be robust with very slight performance drops compared to traditional multi-modal models. We further demonstrate its robustness using ASR hypotheses with different WERs.