Prediction Of Voicing And The F0 Contour From Electromagnetic Articulography Data For Articulation-To-Speech Synthesis
Simon Stone, Philipp Schmidt, Peter Birkholz
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:37
Articulation-to-speech synthesis based solely on supraglottal articulation requires some sort of intonation control. This paper examines to what extent the f0 contour of an utterance can be predicted from such supraglottal articulation data. To that end, three groups of machine learning models (support vector machines, kernel ridge regression and neural networks) were trained and evaluated on the mngu0 speech corpus con- taining synchronous articulatory and audio data. The best voiced/unvoiced/silence classification rates were achieved by a deep neural network with two hidden layers: 85.8 % with no look-ahead (important for on-line applications) and 86 % with a look-ahead of 50 ms. The best f0 prediction model without look-ahead scored a root-mean-square error (RMSE) (when compared to the original f0 contours) of 10.4 Hz using a neural network with one hidden layer, while the best prediction with a look-ahead of 50 ms was attained by kernel ridge regression and an RMSE of 10.3 Hz. The predicted f0 contours were also subjectively evaluated in a listening test by manipulating the f0 of the original speech files using PRAAT. The results are consistent with the objective evaluation.