Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

Maryam Fazel-Zarandi (Meta); Wei-Ning Hsu (Massachusetts Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Self-supervised learning leverages unlabeled data effectively, improving label efficiency and generalization to domains without labeled data. While recent work has studied generalization to more acoustic/linguistic domains, languages, and modalities, these investigations are limited to single-source speech with one primary speaker in the recording. This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective. This objective encourages the model to identify the number of sources, separate and understand the context, and infer the content of masked regions represented as discovered units. Cocktail HuBERT outperforms state-of-the-art results with 69% lower WER on multi-speaker ASR, 31% lower DER on diarization, and is competitive on single- and multi-speaker tasks from SUPERB.

Tags:

Machine learning methods for language

Cocktail HuBERT: Generalized Self-Supervised Pre-training for Mixture and Single-Source Speech

Maryam Fazel-Zarandi (Meta); Wei-Ning Hsu (Massachusetts Institute of Technology)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

Egocentric Action Anticipation for Personal Health

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

Join the IEEE Signal Processing Society