HOW TO PUSH THE FASTEST MODEL 50X FASTER: STREAMING NON-AUTOREGRESSIVE SPEECH SYNTHESIS ON RESOUCE-LIMITED DEVICES

Thinh Van Nguyen (VinBigdata); Cuong H Pham (VinBigdata JSC); Dang-Khoa MAC (VinBigdata)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Minimizing the latency of the text-to-speech system on end user resource-limited devices is one of the top demands of voice-based human-machine interaction applications. In this paper, the FastStreamSpeech model is proposed combining the advantages of the advanced approaches in neural-based speech synthesis. The proposed method includes (1) a specific streaming inference architecture for the non- autoregressive acoustic and vocoder models and (2) an attention masking mechanism in the training phase. The experimental evaluations of the proposed model on budget mobile CPU show that this model can significantly reduce the system latency by 5 to 50 times compared with the original model while maintaining the high quality of output speech.

Tags:

Speech production, perception and psychoacoustics