MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training

Serap Kırbız (MEF Üniversitesi); Ertuğ Karamatlı (Boğaziçi University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

We introduce two unsupervised source separation methods, which involve self-supervised training from singlechannel two-source speech mixtures. Our first method, mixture permutation invariant training (MixPIT), enables learning a neural network model which separates the underlying sources via a challenging proxy task without supervision from the reference sources. Our second method, cyclic mixture permutation invariant training (MixCycle), uses MixPIT as a building block in a cyclic fashion for continuous learning. MixCycle gradually converts the problem from separating mixtures of mixtures into separating single mixtures. We compare our methods to common supervised and unsupervised baselines: permutation invariant training with dynamic mixing (PIT-DM) and mixture invariant training (MixIT). We show that MixCycle outperforms MixIT and reaches a performance level very close to the supervised baseline (PIT-DM) while circumventing the over-separation issue of MixIT. Also, we propose a self-evaluation technique inspired by MixCycle that estimates model performance without utilizing any reference sources. We show that it yields results consistent with an evaluation on reference sources (LibriMix) and also with an informal listening test conducted on a real-life mixtures dataset (REAL-M).

Tags:

Image, Video, and Multidimensional Signal Processing

MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training

Serap Kırbız (MEF Üniversitesi); Ertuğ Karamatlı (Boğaziçi University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Recallable Question Answering-based Re-ranking Considering Semantic Region for Cross-modal Retrieval

Self-Supervised Learning Based Anomaly Detection in Synthetic Aperture Radar Imaging

USEV: Universal Speaker Extraction With Visual Cue

Join the IEEE Signal Processing Society