-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:03
This paper describes an end-to-end voice conversion system, which involves three main ideas: transformer, context preservation mechanisms, and model adaptation. Self-attention in the transformer architecture directly connects all positions, making it easier to learn long range dependencies and improve training efficiency. Context preservation mechanisms accelerate and stabilize training. Adaptation techniques are conductive to the training of the conversion mapping with limited training data. The results show that the proposed method obtains a higher MOS and the training speed is 2.72 times faster than LSTM based baseline system.