IMPROVING FASTSPEECH TTS WITH EFFICIENT SELF-ATTENTION AND COMPACT FEED-FORWARD NETWORK
Yujia Xiao, Xi Wang, Lei He, Frank K. Soong
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:02
FastSpeech, as a feed-forward transformer based TTS, can avoid the slow serial, autoregressive inference to generate the target mel-spectrogram in a parallel way. As a non-autoregressive TTS, the latency and computation load in inference is shifted from vocoder to transformer where the efficiency is limited by the quadratic time and memory complexity in the self-attention mechanism, particularly for a long text sequence. To tackle this challenges, We propose two models, ProbSparseFS and LinearizedFS, which have efficient self-attention arrangements to improve the inference speed and memory complexity. LinearizedFS has achieved 3.4x memory savings and 2.1x inference speedup, compared with the those of the baseline FastSpeech. A further optimized LinearizedFS with a lightweight FFN can accelerate the inference speed by 3.6x more. We do subjective voice quality evaluations in MOS and CMOS of News report and Audiobook applications, for multi-speaker and multi-style scenarios. Test results verified that the proposed models yield a TTS quality which is on-par with that of the baseline system but with much better memory efficiency and inference speed.