-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:15:25
Domain generalization is a major challenge for cross-corpus speech emotion recognition. The recognition performance built on "seen" source corpora is inevitably degraded when the systems are tested against "unseen" target corpora that have different speakers, channels, and languages. We present a novel framework based on a triplet network to learn more generalized features of emotional speech that are invariant across multiple corpora. To reduce the intrinsic discrepancies between source and target corpora, an explicit feature transformation based on the triplet network is implemented as a preprocessing step. Extensive comparison experiments are carried out on three emotional speech corpora; two English corpora, and one Japanese corpus. Remarkable improvements of up-to 35.61% are achieved for all cross-corpus speech emotion recognition, and we show that the proposed framework using the triplet network is effective for obtaining more generalized features across multiple emotional speech corpora.