Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models For Unit Selection Speech Synthesis

Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 11:16

04 May 2020

This paper presents a method of using the intermediate representations between linguistic and acoustic features in a Tacotron model to derive the cost functions for unit selection speech synthesis. By extracting the outputs of the Tacotron encoder, each phone-sized candidate unit in the corpus is represented by a fixed-length unit vector. Similarly, each target unit to be synthesized is also converted into a unit vector of the same dimension by encoding the input phone sequence. The normalized Euclidean distances between these two vectors are utilized to fulfill unit pre-selection and to calculate the target cost for unit selection. Then, another DNN which predicts the unit vector of each phone from its preceding ones is constructed to derive the concatenation cost function. Experimental results demonstrate that the unit vectors extracted from Tacotron contain both duration and acoustic information of phone units. Comparing with our previous work, which learned unit vectors using a DNN and only acoustic features, the method proposed in this paper further improves the naturalness of unit selection speech synthesis in our experiments.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Extracting Unit Embeddings Using Sequence-To-Sequence Acoustic Models For Unit Selection Speech Synthesis

Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society