On the effectiveness of monoaural target source extraction for distant end-to-end automatic speech recognition

Catalin Zorila (Toshiba Cambridge Research Laboratory); Rama S Doddipatla (Toshiba Europe LTD)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Recent work on enhancement has shown that frequency domain methods may outperform the time domain approaches, while most of the prior art is focused on reporting objective enhancement metrics on simulated noisy data or use less modern hybrid acoustic models for evaluation. In this paper we investigate the effectiveness of target source extraction for improving the robustness of end-to-end automatic speech recognition in noisy and reverberant conditions. A frequency domain source extraction approach is introduced and compared against a state-of-the-art time domain method using several publicly available simulated and real noisy speech test sets. The results show that the frequency domain method outperforms the time domain one only for simulated conditions, and that it is more stable to window size variations. The experiments also indicate that remixing the unprocessed signal with the enhanced speech (referred to as speaker/source reinforcement) yields similar or better results than by using a matched acoustic model retrained using distortions introduced by enhancement.

Tags:

Robust speech recognition and adaptation

On the effectiveness of monoaural target source extraction for distant end-to-end automatic speech recognition

Catalin Zorila (Toshiba Cambridge Research Laboratory); Rama S Doddipatla (Toshiba Europe LTD)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DATA2VEC-AQC: SEARCH FOR THE RIGHT TEACHING ASSISTANT IN THE TEACHER-STUDENT TRAINING SETUP

BENCHMARK OF PHYSIOLOGICAL MODEL BASED AND DEEP LEARNING BASED REMOTE PHOTOPLETHYSMOGRAPHY IN AUTOMOTIVE

FAST AND PARALLEL DECODING FOR TRANSDUCER

Join the IEEE Signal Processing Society