Towards Low-Resource Stargan Voice Conversion Using Weight Adaptive Instance Normalization

Mingjie Chen, Yanpei Shi, Thomas Hain

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:55

08 Jun 2021

Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. It is challenging because of lacking of ground truth parallel data. StarGAN-based models have gained attentions because of their efficiency and effectiveness. However, most of the StarGAN-based works only focused on small number of speakers and large amount of training data. In this work, we aim at improving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relatively large number of speakers with limited training samples. In order to improve data efficiency, the proposed model uses a speaker encoder for extracting speaker embeddings and weight adaptive instance normalization (W-AdaIN) layers. Experiments are conducted with 109 speakers under two low-resource situations, where the number of training samples is 20 and 5 per speaker. An objective evaluation shows the proposed model outperforms baseline methods significantly. Furthermore, a subjective evaluation shows that, for both naturalness and similarity, the proposed model outperforms baseline method.

Chairs:

Tomoki Toda

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021