Voice Conversion Using Feature Specific Loss Function based Self-Attentive Generative Adversarial Network
Sandipan Dhar (National Institute of Technology Durgapur); Padmanabha Banerjee (Jalpaiguri Engineering College); Dr. Nanda Dulal Jana (NIT Durgapur); Swagatam Das (Indian Statistical Institute)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Voice conversion (VC) is the process of converting the vocal texture of a source speaker similar to that of a target speaker without altering the content of the source speaker's speech. With the ongoing developments of deep generative models, generative adversarial networks (GANs) appeared as a better alternative to the conventional statistical models for VC. However, the existing VC model-generated speech samples possess substantial dissimilarity from their corresponding natural human speech. Therefore, in this work a GAN-based VC model is proposed which is incorporated with a self-attention (SA) mechanism based generator network to obtain the formant distribution of the target mel-spectrogram efficiently. Moreover, the modulation spectra distance (MSD) is also incorporated in this work as a feature-specific loss in terms of getting high speaker similarity. The proposed model has been tested with CMU Arctic and VCC 2018 datasets. Based on the objective and subjective evaluations, we observe the proposed feature-specific loss-based self-attentive GAN (FLSGAN-VC) model significantly performed better than the state-of-the-art (SOTA) MelGAN-VC model.