UNSUPERVISED PRE-TRAINING FOR DATA-EFFICIENT TEXT-TO-SPEECH ON LOW RESOURCE LANGUAGES
Seongyeon Park (Seoul National University); Myungseo Song (CNAI); bohyung kim (CNAI); Tae-Hyun Oh (POSTECH)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Neural text-to-speech (TTS) models can synthesize natural human
speech when trained on large amounts of transcribed speech.
However, collecting such large-scale transcribed data is expensive.
This paper proposes an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
With our pre-training, we can remarkably reduce the amount of paired transcribed data required to train the model for the target downstream TTS task.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones, which may allow the model to learn proper temporal assignment relation between input and output sequences.
In addition, we propose a data augmentation method that further improves the data efficiency in fine-tuning.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios, achieving outstanding performance compared to competing methods.
The code and audio samples are available at: https://github.com/cnaigithub/SpeechDewarping