MOS Predictor for Synthetic Speech with I-vector Inputs
Miao Liu, Jing Wang, Shicong Li, Fei Xiang, Yue Yao, Lidong Yang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:07:58
Based on deep learning technology, non-intrusive methods have received increasing attention for synthetic speech quality assessment since it does not need reference signals. Meanwhile, i-vector has been widely used in paralinguistic speech attribute recognition such as speaker and emotion recognition, but few studies have used it to estimate speech quality. In this paper, we propose a neural-network-based model that splices the deep features extracted by convolutional neural network (CNN) and i-vector on the time axis and uses Transformer encoder as time sequence model. To evaluate the proposed method, we improve the previous prediction models and conduct experiments on Voice Conversion Challenge (VCC) 2018 and 2016 dataset. Results show that i-vector contains information very related to the quality of synthetic speech and the proposed models that utilize i-vector and Transformer encoder highly increase the accuracy of MOSNet and MBNet on both utterance-level and system-level results.