Cross-speaker Emotion Transfer by Manipulating Speech Style Latents
Suhee Jo (Neosapience, Inc.); Younggun Lee (Neosapience); Yookyung Shin (Neosapience, Inc.); Yeongtae Hwang (Neosapience, Inc.); Taesu Kim (Neosapience, Inc.)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In recent years, emotional text-to-speech has shown considerable progress. However, it requires a large amount of labeled data, which is not easily accessible. Even if it is possible to acquire an emotional speech dataset, there is still a limitation in controlling emotion intensity. In this work, we propose a novel method for cross-speaker emotion transfer and manipulation using vector arithmetic in latent style space. By leveraging only a few labeled samples, we generate emotional speech from reading-style speech without losing the speaker identity. Furthermore, emotion strength is readily controllable using a scalar value, providing an intuitive way for users to manipulate speech. Experimental results show the proposed method affords superior performance in terms of expressiveness, naturalness, and controllability, preserving speaker identity.