Realistic Real-Time Voice Swapping From Single Unpaired Sentences
Carlo Provinciali, Yihong Liu, Junghoo Kim, Iddo Drori
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 09:11
We demonstrate a system that allows two speakers to swap their voices from any two unpaired sentences such that the result is indistinguishable from real voices and performed in real-time on a laptop. Each of the two speakers takes turns pronouncing any unpaired single short sentences into a microphone. Our demo plays the original voice recordings, then swaps the speakers voices, playing the words pronounced by the first speaker with the secondâs speaker voice and vice-versa. The two input voices are processed in two distinct ways; one to extract the text of each speech, and one to learn each speaker's unique voice profile. We extract the text from speakersâ A speech by using state of the art pre-trained voice-to-text models. We then pass the audio from speaker B through an encoder, which derives an embedding that describes speakersâ B distinctive features. Next, we use the text extracted from speaker A and the embeddings of speaker B to synthesize the Mel spectrogram, which is fed into a vocoder to generate the final audio of speakersâ A sentence with speakersâ B voice. The same process is mirrored with speaker A and B's roles swapped. Our implementation leverages pre-trained neural networks: an encoder, synthesizer, and vocoder models, for a realistic real-time performance.