Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Yang Zhang (NVIDIA); Krishna C Puvvada (NVIDIA); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). The model consists of a TitaNet based speaker embedding module, a Conformer based masking as well as ASR modules. These modules are trained using Connectionist Temporal Classification (CTC) loss and are jointly optimized to transcribe a target-speaker, while ignoring speech from other speakers. Optionally, we include a novel scale-invariant spectrogram reconstruction loss to encourage the model better separate the target-speaker's spectrogram from mixture. We obtain state-of-the-art target-speaker word error rate (TS-WER) on WSJ0-2mix-extr (4.2%). Further, we report for the first time TS-WER on WSJ0-3mix-extr (12.4%), LibriSpeech2Mix (4.6%) and LibriSpeech3Mix (7.6%) datasets, establishing new benchmarks for TS-ASR. The proposed model will be open-sourced through NVIDIA NeMo toolkit.

Tags:

Speech enhancement and separation

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Yang Zhang (NVIDIA); Krishna C Puvvada (NVIDIA); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing Audio-Visual Speech Enhancement

SINGLE-CHANNEL SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NETWORKS AND PROBABILISTIC LATENT SPACE MODELS

Fast and Efficient Speech Enhancement with Variational Autoencoders

Join the IEEE Signal Processing Society