Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts
Detai Xin (The University of Tokyo); Sharath Adavanne (Rakuten Inc.); Federico Ang (Rakuten Inc.); Ashish Kulkarni (Rakuten); Shinnosuke Takamichi (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech.
Previous work either uses unilateral or single-modality context, which does not fully represent the context information.
The proposed method uses an acoustic context encoder and a textual context encoder to aggregate context information and feeds it to the TTS model, which enables the model to predict context-dependent prosody.
We conducted comprehensive objective and subjective evaluations on a multi-speaker Japanese audiobook dataset.
Experimental results demonstrate that the proposed method significantly outperforms two previous works.
Additionally, we present insights about the different choices of context - modalities, lateral information and length - for audiobook TTS that have never been discussed in the literature before.