Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Yinghao A Li (Columbia University); Cong Han (Columbia Univeristy); Xilin Jiang (Columbia University); Nima Mesgarani (Columbia University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.

Tags:

language modeling

Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions

Yinghao A Li (Columbia University); Cong Han (Columbia Univeristy); Xilin Jiang (Columbia University); Nima Mesgarani (Columbia University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Large-Scale and Parameter-Efficient Language Modeling for Speech Processing

HAG: Hierarchical Attention with Graph Network for Dialogue Act Classification in Conversation

Enhancing Unsupervised Speech Recognition with Diffusion GANs

Join the IEEE Signal Processing Society