Semi-Supervised Learning Based On Hierarchical Generative Models For End-To-End Speech Synthesis
Takato Fujimoto, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:24
This paper proposes a general framework of semi-supervised learning based on hierarchical generative models and adapts it to a Japanese end-to-end text-to-speech (TTS) system. In English TTS, several end-to-end systems have recently achieved sound quality close to that of natural human speech. However, in non-alphabetic languages such as Japanese, it is difficult to realize true text-input end-to-end TTS due to character diversity and pitch accents. To address this problem, we propose end-to-end TTS based on semi-supervised learning that makes the most of existing data consisting of any combination of text, phoneme, and waveform as training data. To demonstrate the effectiveness of the proposed system, listening tests were conducted for pronunciation and naturalness. Our results show that the proposed system improves both pronunciation and naturalness.