IMPROVING PHONETIC REALIZATIONS IN TTS BY USING PHONEME-ALIGNED GRAPHEMES
Manish Sharma, Yizhi Hong, Emily Kaplan, Siamak Tazari, Rob Clark
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:14:54
Most text-to-speech acoustic models, such as WaveNet, Tacotron, ClariNet, etc., use either a phoneme sequence or a letter sequence as the fundamental unit of speech. Although the letter (or grapheme) sequence closely matches the actual runtime input of the TTS system, it often fails to represent the fine-grained phonetic variations. A purely phonemic input seems to perform better in practice, though is heavily dependent on a meticulously crafted phonology and lexicon. This reliance poses issues (with quality and consistency) which can lead to the need for a trade-off between quality and scalability. To overcome this, we propose using a mix of the two inputs, namely providing phoneme-aligned graphemes to the model. In this paper, we show that this approach can help the model learn to disambiguate some of the more subtle phonemic variations (such as the realization of reduced vowels), and that this effect improves the fidelity to the accent of the original voice talent. For evaluation, we present a way of generating an unbiased targeted test using phoneme spectral diffs, and using that, show improvement over the baseline approach for multiple voice technologies and multiple locales.