SELECTIVE MULTI-TASK LEARNING FOR SPEECH EMOTION RECOGNITION USING CORPORA OF DIFFERENT STYLES
Heran Zhang, Masato Mimura, Tatsuya Kawahara, Kenkichi Ishizuka
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:17
While speech emotion recognition (SER) has been actively studied, the amount and variations of training data are limited compared with speech recognition and speaker recognition tasks. Therefore, it is promising to combine multiple corpora to train a generalized SER model. However, the manner of emotion expression is different according to the settings, task domains, and languages. In particular, there is a mismatch between acted datasets and spontaneous datasets since the former includes much more rich and explicit emotion expressions than the latter. In this paper, we investigate effective combination methods based on multi-task learning (MTL) considering the style attribute. We also hypothesize the neutral expression, which has the largest number of samples, is not affected by the style, and thus propose a selective MTL method that applies MTL to emotion categories except for the neutral category. Experimental evaluations using the IEMOCAP database and a call center dataset confirm the effect of the combination of the two corpora, MTL, and the proposed selective MTL.