VF-TACO2: TOWARDS FAST AND LIGHTWEIGHT SYNTHESIS FOR AUTOREGRESSIVE MODELS WITH VARIATION AUTOENCODER AND FEATURE DISTILLATION
Yuhao Liu ( Tianjin University); Cheng Gong (Tianjin University); Longbiao Wang (Tianjin University); Xixin Wu (The Chinese University of Hong Kong); Qiuyu Liu (Tianjin University); Jianwu Dang (Tianjin University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
With the development of deep learning, end-to-end neural text-to-speech (TTS) systems have achieved significant improvements in high-quality speech synthesis. However, most of these systems are attention-based autoregressive models, resulting in slow synthesis speed and large model parameter sizes. In this paper, we propose a new fast and lightweight TTS framework named VF-Taco2, which can quickly synthesize speech without GPUs. We first profiled the complexity of decoder process in the current autoregressive model and designed a novel multiple frames prediction module based on variational autoencoder (VAE) to alleviate quality degradation when a larger “reduction factor” is applied. Besides, feature distillation is leveraged to compress a relatively large proposed model to its small version with a minor loss of speech quality. Compared to the original Tacotron 2, our VF-Taco2 achieves a 3.6x-4.4x Mel-spectrum generation acceleration on different performance CPUs, and the parameters are compressed by 1.5x with speech quality maintained.