An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data
Feng-Long Xie, Xin-Hui Li, Bo Liu, Frank K. Soong, Yi-Bin Zheng, Li Meng, Li Lu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 10:34
A frame-unit-selection based voice conversion system proposed earlier by us is revisited here to enhance its performance in both speech naturalness and speaker similarity. Speaker independent, bilingual (Mandarin Chinese and American English) deep neural net (DNN) acoustic modelâs output, frame-level phone posterior probability (PPP), is used to represent the phonetic information. The corresponding frame-level F0 is used as the prosodic information. Kullback-Leibler divergence (KLD) between source and target PPPs (phonetic distortion) and the absolute difference between normalized source and target F0 (prosodic distortion) are used for selecting target frame candidates to construct a search lattice. The optimal target unit trajectory is obtained by Viterbi algorithm which tries to minimize the dynamic acoustic difference between the acoustic trajectory of the source speech and target candidates. The obtained spectral trajectory together with the enhanced pitch period and pitch correlation trajectory are sent to LPCNet vocoder to synthesize the converted waveforms. Compared with the top-rank system in Voice Conversion Challenge 2018, our new system can achieve on-par performance on studio to studio American English VC test, and better performance on non-studio to studio Mandarin Chinese VC test, in both speech naturalness MOS and speaker similarity DMOS.