Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 11:16
04 May 2020

This paper presents a method of using the intermediate representations between linguistic and acoustic features in a Tacotron model to derive the cost functions for unit selection speech synthesis. By extracting the outputs of the Tacotron encoder, each phone-sized candidate unit in the corpus is represented by a fixed-length unit vector. Similarly, each target unit to be synthesized is also converted into a unit vector of the same dimension by encoding the input phone sequence. The normalized Euclidean distances between these two vectors are utilized to fulfill unit pre-selection and to calculate the target cost for unit selection. Then, another DNN which predicts the unit vector of each phone from its preceding ones is constructed to derive the concatenation cost function. Experimental results demonstrate that the unit vectors extracted from Tacotron contain both duration and acoustic information of phone units. Comparing with our previous work, which learned unit vectors using a DNN and only acoustic features, the method proposed in this paper further improves the naturalness of unit selection speech synthesis in our experiments.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00