LEARNING TO AUTO-CORRECT FOR HIGH-QUALITY SPECTROGRAMS
Zhiyang Zhou (Beijing Bombax XiaoIce Technology Co., Ltd); Shihui Liu (Beijing Bombax XiaoIce Technology Co., Ltd)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Non-autoregressive text-to-speech (TTS) has achieved impressive inference speedup but at the cost of inferior voice quality. The fundamental reason lies in the gap between the complexity of data distributions and the capability of modeling methods. Previous works utilize either simplifying data distributions or enhancing modeling methods to alleviate the problem. In this work, we propose a new architecture ReActSpeech to explicitly learn to ”auto-correct” for high-quality spectrograms. Specifically, ReActSpeech utilizes a redistribution module to improve (correct) extracted alignments automatically, and an iterative decoder called revisor to refine (correct) spectrograms iteratively. Extensive experiments conducted on several benchmarks show that ReActSpeech can greatly alleviate the above problem and achieve a nice tradeoff between training time, inference speed, and output quality.