Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Zili Huang (Johns Hopkins University); Desh Raj (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Self-supervised learning (SSL) methods which learn representations of data without explicit supervision have gained popularity in speech-processing tasks, particularly for single-talker applications. However, these models often have degraded performance for multi-talker scenarios --- possibly due to the domain mismatch --- which severely limits their use for such applications. In this paper, we investigate the adaptation of upstream SSL models to the multi-talker automatic speech recognition (ASR) task under two conditions. First, when segmented utterances are given, we show that adding a target speaker extraction (TSE) module based on enrollment embeddings is complementary to mixture-aware pre-training. Second, for unsegmented mixtures, we propose a novel joint speaker modeling (JSM) approach, which aggregates information from all speakers in the mixture through their embeddings. With controlled experiments on Libri2Mix, we show that using speaker embeddings provides relative WER improvements of 9.1% and 42.1% over strong baselines for the segmented and unsegmented cases, respectively. We also demonstrate the effectiveness of our models for real conversational mixtures through experiments on the AMI dataset. Our code and models are open-sourced on https://github.com/HuangZiliAndy/SSL_for_multitalker.

Tags:

Robust speech recognition and adaptation

Adapting self-supervised models to multi-talker speech recognition using speaker embeddings

Zili Huang (Johns Hopkins University); Desh Raj (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DATA2VEC-AQC: SEARCH FOR THE RIGHT TEACHING ASSISTANT IN THE TEACHER-STUDENT TRAINING SETUP

BENCHMARK OF PHYSIOLOGICAL MODEL BASED AND DEEP LEARNING BASED REMOTE PHOTOPLETHYSMOGRAPHY IN AUTOMOTIVE

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Join the IEEE Signal Processing Society