VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation
Rohan Badlani (NVIDIA); Akshit Arora (NVIDIA); Subhankar Ghosh (NVIDIA); Rafael Valle (NVIDIA); Kevin Shih (NVIDIA); João Felipe Santos (NVIDIA); Boris Ginsburg (NVIDIA); Bryan Catanzaro (NVIDIA)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM and supports explicit control of accent, language, speaker and fine-grained F0 and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 2023 as part of ICASSP Signal Processing Grand Challenge, to synthesize speech in 3 different languages. Our model supports transferring the language of a speaker while retaining their voice and the native accent of the target language. We utilize large-parameter RADMMM model for Track 1 and light-weight VANI model for Track 2 and 3 of the competition.