Skip to main content

Target Speaker Extraction with Ultra-Short Reference Speech by VE-VE Framework

Lei Yang (Samsung); Wei Liu (Samsung); Lufen Tan (Samsung); Jaemo Yang (Samsung); Han-gil Moon (Samsung)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

The goal of target speaker extraction (TSE) is to extract the target speaker's voice from the mixture speech of multiple speakers. It needs to enroll the speech of the target speaker in advance as a reference. However, in practical applications, too long reference speech during enrollment will decrease the user's motivation to use. In this paper, we propose a Voice Extractor-Voice Extractor (VE-VE) framework for TSE task with ultra-short enrollment speech. The enrollment and voice extraction processes utilize the same RNN-based voice extractor. Speaker characteristics are carried by the RNN state. We design a network VEVEN to test the effectiveness of our proposed framework. Experiments show that it achieves a new state-of-the-art (SOTA) performance on public WSJ0-2mix datasets. Furthermore, our approach has the capacity to support ultra-short reference speech requirement. 17.7dB SI-SDRi is achieved for 0.2s reference speech.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00