PHONETIC ANCHOR-BASED TRANSFER LEARNING TO FACILITATE UNSUPERVISED CROSS-LINGUAL SPEECH EMOTION RECOGNITION
Shreya G Upadhyay (National Tsing Hua University); Luz Martinez-Lucas (Department of Electrical and Computer Engineering, University of Texas at Dallas); Bo-Hao Su (Department of Electrical Engineering, National Tsing Hua University); Wei-Cheng Lin (The University of Texas at Dallas); Woan-Shiuan Chien (Department of Electrical Engineering, National Tsing Hua University ); Ya-Tse Wu (Department of Electrical Engineering, National Tsing Hua University); William F Katz (UT Dallas); Carlos Busso (University of Texas at Dallas); Chi-Chun Lee (National Tsing Hua University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Modeling cross-lingual speech emotion recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities between the languages. This study focuses in domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed twofold. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).