Lightweight Prosody-TTS for multi-lingual multi-speaker scenario
Giridhar Pamisetty (IIT Hyderabad); Chaitanya Varun Sahukari (IIT Hyderabad); Sri Rama Murty Kodukula (IIT Hyderabad)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This work presents a lightweight end-to-end text-to-speech (TTS) synthesis for the multi-lingual multi-speaker (ML-MS) scenario. The proposed system uses nonautoregressive modular architecture with interconnected subnets for text-encoder, duration estimator, f0 estimator, and acoustic decoder. The text encoder is conditioned with language embeddings, while the duration and f0 estimators are conditioned with speaker embeddings. All the subnets are optimized in an end-to-end fashion using accumulated loss across the modules. The intermediate auxiliary loss functions help effectively capture the speech information with lesser data. The proposed model achieved a mean opinion score (MOS) of 4.40 and a speaker similarity score of 3.8 with just 4.89 million (M) parameters in LIMMITS grand challenge organized as part of ICASSP-23.