Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models For Unit Selection Speech Synthesis
Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 11:16
This paper presents a method of using the intermediate representations between linguistic and acoustic features in a Tacotron model to derive the cost functions for unit selection speech synthesis. By extracting the outputs of the Tacotron encoder, each phone-sized candidate unit in the corpus is represented by a fixed-length unit vector. Similarly, each target unit to be synthesized is also converted into a unit vector of the same dimension by encoding the input phone sequence. The normalized Euclidean distances between these two vectors are utilized to fulfill unit pre-selection and to calculate the target cost for unit selection. Then, another DNN which predicts the unit vector of each phone from its preceding ones is constructed to derive the concatenation cost function. Experimental results demonstrate that the unit vectors extracted from Tacotron contain both duration and acoustic information of phone units. Comparing with our previous work, which learned unit vectors using a DNN and only acoustic features, the method proposed in this paper further improves the naturalness of unit selection speech synthesis in our experiments.