Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:37
04 May 2020

Articulation-to-speech synthesis based solely on supraglottal articulation requires some sort of intonation control. This paper examines to what extent the f0 contour of an utterance can be predicted from such supraglottal articulation data. To that end, three groups of machine learning models (support vector machines, kernel ridge regression and neural networks) were trained and evaluated on the mngu0 speech corpus con- taining synchronous articulatory and audio data. The best voiced/unvoiced/silence classification rates were achieved by a deep neural network with two hidden layers: 85.8 % with no look-ahead (important for on-line applications) and 86 % with a look-ahead of 50 ms. The best f0 prediction model without look-ahead scored a root-mean-square error (RMSE) (when compared to the original f0 contours) of 10.4 Hz using a neural network with one hidden layer, while the best prediction with a look-ahead of 50 ms was attained by kernel ridge regression and an RMSE of 10.3 Hz. The predicted f0 contours were also subjectively evaluated in a listening test by manipulating the f0 of the original speech files using PRAAT. The results are consistent with the objective evaluation.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00