Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Powered by the recent advances in AI-based representation and generation, text-to-speech technology has reached unprecedented levels in quality and flexibility. Self-supervised learning techniques have provided ways to formulate efficient latent spaces claiming more control over different qualities of the generated speech, zero-shot training allowed matching the characteristics of unseen speakers, and efficient prior networks contributed to disentangling content, speaker, emotion and other dimensions of speech. These developments have boosted existing application areas but also allowed tackling new ones that previously seemed much more distant. We’ll discuss some of the recent advances in specific areas in the field, including our team’s work on multi-speaker, multi-/cross-lingual, expressive and controllable TTS, on synthesized singing, as well as on automatic synthetic speech evaluation. We’ll also look into cloning existing speakers as well as generating novel ones. Finally, we’ll touch on the valid concerns that such unprecedented technical capabilities raise. Voice is a key element of one’s identity and although such technologies hold great promise for useful applications, at the same time they have a potential for abuse, thus raising ethical and intellectual property questions, both in the context of the creative industries and in our everyday lives.