On the effectiveness of monoaural target source extraction for distant end-to-end automatic speech recognition
Catalin Zorila (Toshiba Cambridge Research Laboratory); Rama S Doddipatla (Toshiba Europe LTD)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recent work on enhancement has shown that frequency domain methods may outperform the time domain approaches, while most of the prior art is focused on reporting objective enhancement metrics on simulated noisy data or use less modern hybrid acoustic models for evaluation.
In this paper we investigate the effectiveness of target source extraction for improving the robustness of end-to-end automatic speech recognition in noisy and reverberant conditions.
A frequency domain source extraction approach is introduced and compared against a state-of-the-art time domain method using several publicly available simulated and real noisy speech test sets.
The results show that the frequency domain method outperforms the time domain one only for simulated conditions, and that it is more stable to window size variations.
The experiments also indicate that remixing the unprocessed signal with the enhanced speech (referred to as speaker/source reinforcement) yields similar or better results than by using a matched acoustic model retrained using distortions introduced by enhancement.