Skip to main content

Realistic Real-Time Voice Swapping From Single Unpaired Sentences

Carlo Provinciali, Yihong Liu, Junghoo Kim, Iddo Drori

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 09:11
04 May 2020

We demonstrate a system that allows two speakers to swap their voices from any two unpaired sentences such that the result is indistinguishable from real voices and performed in real-time on a laptop. Each of the two speakers takes turns pronouncing any unpaired single short sentences into a microphone. Our demo plays the original voice recordings, then swaps the speakers voices, playing the words pronounced by the first speaker with the second’s speaker voice and vice-versa. The two input voices are processed in two distinct ways; one to extract the text of each speech, and one to learn each speaker's unique voice profile. We extract the text from speakers’ A speech by using state of the art pre-trained voice-to-text models. We then pass the audio from speaker B through an encoder, which derives an embedding that describes speakers’ B distinctive features. Next, we use the text extracted from speaker A and the embeddings of speaker B to synthesize the Mel spectrogram, which is fed into a vocoder to generate the final audio of speakers’ A sentence with speakers’ B voice. The same process is mirrored with speaker A and B's roles swapped. Our implementation leverages pre-trained neural networks: an encoder, synthesizer, and vocoder models, for a realistic real-time performance.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00