HIGH-ACOUSTIC FIDELITY TEXT TO SPEECH SYNTHESIS WITH FINE-GRAINED CONTROL OF SPEECH ATTRIBUTES

Rafael Valle (NVIDIA); João Felipe Santos (NVIDIA); Kevin Shih (NVIDIA); Rohan Badlani (NVIDIA); Bryan Catanzaro (NVIDIA)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Recently developed neural-based TTS models have focused on robustness and finer control over acoustic features such as phoneme duration, energy, and f0, allowing users to have some degree of control over the prosody of the generated speech. We propose a model with fine grained attribute control, which also has better acoustic fidelity (attributes of the output which we want to control do not deviate from the control signals) than previously proposed models as shown in our experiments. Unlike other models, our proposed model does not require fine-tuning the vocoder on its outputs, indicating that it generates higher quality mel-spectrograms that are closer to the ground-truth distribution than that of other models.

Tags:

Speech production, perception and psychoacoustics

HIGH-ACOUSTIC FIDELITY TEXT TO SPEECH SYNTHESIS WITH FINE-GRAINED CONTROL OF SPEECH ATTRIBUTES

Rafael Valle (NVIDIA); João Felipe Santos (NVIDIA); Kevin Shih (NVIDIA); Rohan Badlani (NVIDIA); Bryan Catanzaro (NVIDIA)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

NBA-OMP: NEAR-FIELD BEAM-SPLIT-AWARE ORTHOGONAL MATCHING PURSUIT FOR WIDEBAND THZ CHANNEL ESTIMATION

Meta-AF: Meta-Learning for Adaptive Filters

Online Phase Reconstruction via DNN-Based Phase Differences Estimation

Join the IEEE Signal Processing Society