Skip to main content

An Improved Frame-Unit-Selection Based Voice Conversion System Without Parallel Training Data

Feng-Long Xie, Xin-Hui Li, Bo Liu, Frank K. Soong, Yi-Bin Zheng, Li Meng, Li Lu

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 10:34
04 May 2020

A frame-unit-selection based voice conversion system proposed earlier by us is revisited here to enhance its performance in both speech naturalness and speaker similarity. Speaker independent, bilingual (Mandarin Chinese and American English) deep neural net (DNN) acoustic model’s output, frame-level phone posterior probability (PPP), is used to represent the phonetic information. The corresponding frame-level F0 is used as the prosodic information. Kullback-Leibler divergence (KLD) between source and target PPPs (phonetic distortion) and the absolute difference between normalized source and target F0 (prosodic distortion) are used for selecting target frame candidates to construct a search lattice. The optimal target unit trajectory is obtained by Viterbi algorithm which tries to minimize the dynamic acoustic difference between the acoustic trajectory of the source speech and target candidates. The obtained spectral trajectory together with the enhanced pitch period and pitch correlation trajectory are sent to LPCNet vocoder to synthesize the converted waveforms. Compared with the top-rank system in Voice Conversion Challenge 2018, our new system can achieve on-par performance on studio to studio American English VC test, and better performance on non-studio to studio Mandarin Chinese VC test, in both speech naturalness MOS and speaker similarity DMOS.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00