PERIOD VITS: VARIATIONAL INFERENCE WITH EXPLICIT PITCH MODELING FOR END-TO-END EMOTIONAL SPEECH SYNTHESIS

Yuma Shirahata (LINE Corp.); Ryuichi Yamamoto (LINE Corp.); Eunwoo Song (Naver Corporation); Ryo Terashima (LINE Corp.); Jae-Min Kim (NAVER Cloud Corp.); Kentaro Tachibana (LINE Corp.)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.

Tags:

Speech and singing voice synthesis/convertion/coding

PERIOD VITS: VARIATIONAL INFERENCE WITH EXPLICIT PITCH MODELING FOR END-TO-END EMOTIONAL SPEECH SYNTHESIS

Yuma Shirahata (LINE Corp.); Ryuichi Yamamoto (LINE Corp.); Eunwoo Song (Naver Corporation); Ryo Terashima (LINE Corp.); Jae-Min Kim (NAVER Cloud Corp.); Kentaro Tachibana (LINE Corp.)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

IMPROVED APPLIANCE TRANSIENT FEATURE EXTRACTION VIA TEMPLATE MATCHING

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Join the IEEE Signal Processing Society