Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework

Lei Yang (Samsung); Wei Liu (Samsung); Lufen Tan (Samsung); Jaemo Yang (Samsung); Han-gil Moon (Samsung)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

The goal of target speaker extraction (TSE) is to extract the target speaker's voice from the mixture speech of multiple speakers. It needs to enroll the speech of the target speaker in advance as a reference. However, in practical applications, too long reference speech during enrollment will decrease the user's motivation to use. In this paper, we propose a Voice Extractor-Voice Extractor (VE-VE) framework for TSE task with ultra-short enrollment speech. The enrollment and voice extraction processes utilize the same RNN-based voice extractor. Speaker characteristics are carried by the RNN state. We design a network VEVEN to test the effectiveness of our proposed framework. Experiments show that it achieves a new state-of-the-art (SOTA) performance on public WSJ0-2mix datasets. Furthermore, our approach has the capacity to support ultra-short reference speech requirement. 17.7dB SI-SDRi is achieved for 0.2s reference speech.

Tags:

Speech enhancement and separation

Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework

Lei Yang (Samsung); Wei Liu (Samsung); Lufen Tan (Samsung); Jaemo Yang (Samsung); Han-gil Moon (Samsung)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing Audio-Visual Speech Enhancement

SINGLE-CHANNEL SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NETWORKS AND PROBABILISTIC LATENT SPACE MODELS

Fast and Efficient Speech Enhancement with Variational Autoencoders

Join the IEEE Signal Processing Society