AUTOREGRESSIVE VARIATIONAL AUTOENCODER WITH A HIDDEN SEMI-MARKOV MODEL-BASED STRUCTURED ATTENTION FOR SPEECH SYNTHESIS
Takato Fujimoto, Kei Hashimoto, Yoshihiko Nankaku, Keiichi Tokuda
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:34
This paper proposes an autoregressive speech synthesis model based on the variational autoencoder incorporating latent sequence representation for acoustic and linguistic features and the structure of a hidden semi-Markov model (HSMM). Although autoregressive models can provide efficient and accurate modeling of acoustic features, they have exposure bias, i.e., the mismatch between training (teacher-forcing) and inference (free-running). To overcome this problem, we introduce an autoregressive latent variable sequence, rather than using autoregressive generation of observations. Latent representation of alignment using HSMM-based structured attention mechanism enables the use of a completely consistent training algorithm for acoustic modeling with explicit duration models. Experimental results indicate that the proposed model outperformed baselines in subjective naturalness.