A New High Quality Trajectory Tiling Based Hybrid Tts In Real Time
Feng-Long Xie, Xin-Hui Li, Wen-Chao Su, Li Lu, Frank K. Soong
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:37
A trajectory tiling based, hybrid TTS is revisited in this study for improving its synthesis performance. A combination of Transformer encoder and RNN based decoder architecture where two-level, at both word and Chinese phonetic alphabet letter levels, linguistic representation is exploited to generate a cogent and smooth speech parameter trajectory. And then a segment candidate lattice is constructed by minimizing the log spectral distortion of mel-spectrograms and RMSE of F0 between the generated trajectory and candidates. Normalized cross-correlation is used to find the best sequence of “waveform tiles” in the lattice for synthesizing the final speech waveforms. Subjective A/B preference tests show that the new hybrid system outperforms our earlier trajectory-tiling hybrid baseline TTS (67% vs 11%) and the state-of-the-art, real-time TTS system constructed with Tacotron 2 and LPCNet (56% vs 27%).
Chairs:
Yu Zhang