IMPROVING PHONETIC REALIZATIONS IN TTS BY USING PHONEME-ALIGNED GRAPHEMES

Manish Sharma, Yizhi Hong, Emily Kaplan, Siamak Tazari, Rob Clark

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:54

08 May 2022

Most text-to-speech acoustic models, such as WaveNet, Tacotron, ClariNet, etc., use either a phoneme sequence or a letter sequence as the fundamental unit of speech. Although the letter (or grapheme) sequence closely matches the actual runtime input of the TTS system, it often fails to represent the fine-grained phonetic variations. A purely phonemic input seems to perform better in practice, though is heavily dependent on a meticulously crafted phonology and lexicon. This reliance poses issues (with quality and consistency) which can lead to the need for a trade-off between quality and scalability. To overcome this, we propose using a mix of the two inputs, namely providing phoneme-aligned graphemes to the model. In this paper, we show that this approach can help the model learn to disambiguate some of the more subtle phonemic variations (such as the realization of reduced vowels), and that this effect improves the fidelity to the accent of the original voice talent. For evaluation, we present a way of generating an unbiased targeted test using phoneme spectral diffs, and using that, show improvement over the baseline approach for multiple voice technologies and multiple locales.

Tags:

vowels

schwa

accent

phonology

graphemes

IMPROVING PHONETIC REALIZATIONS IN TTS BY USING PHONEME-ALIGNED GRAPHEMES

Manish Sharma, Yizhi Hong, Emily Kaplan, Siamak Tazari, Rob Clark

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PHONOLOGY RECOGNITION IN AMERICAN SIGN LANGUAGE

Join the IEEE Signal Processing Society