Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:13:04
08 Jun 2021

The VQ-VAE based voice conversion models have lately received increasing attention in non-parallel many to many voice conversion, where the encoder extracts the speaker-invariant linguistic content from the input speech using vector quantization and the decoder produces the target speech from the encoder output, conditioned on the target speaker representation. However, it is challenging for the encoder to find a proper balance between removing the speaker information and preserving the linguistic content, which degrades the converted speech quality. To address this issue, we propose the Local Linguistic Tokens (LLTs) model to learn high-quality speaker-invariant linguistic embeddings using the multi-head attention module, which has shown great success in extracting speaking style embeddings in Global Style Tokens (GSTs). Instead of vector quantization, the multi-head attention module makes the encoder preserve more linguistic content to enhance the converted speech quality. Both objective and subjective experimental results revealed that, compared with the state-of-the-art VQ-VAE model, the proposed LLTs model achieved significantly better speech quality and comparable speaker similarity. The converted samples are available online for listening.

Chairs:
Tomoki Toda

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00