MULTI-CHANNEL MULTI-SPEAKER ASR USING 3D SPATIAL FEATURE

Yiwen Shao, Shi-Xiong Zhang, Dong Yu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:11:54

08 May 2022

Automatic speech recognition (ASR) of multi-channel multi-speaker overlapped speech remains one of the most challenging tasks to the speech community. In this paper, we look into this challenge by utilizing the location information of target speakers in the 3D space for the first time. To explore the strength of proposed the 3D spatial feature, two paradigms are investigated. 1) a pipelined system with a multi-channel speech separation module followed by the state-of-the-art single-channel ASR module; 2) a ``All-In-One" model where the 3D spatial feature is directly used as an input to ASR system without explicit separation modules. Both of them are fully differentiable and can be back-propagated end-to-end. We test them on simulated overlapped speech and real recordings. Experimental results show that 1) the proposed ALL-In-One model achieved a comparable error rate to the pipelined system while reducing the inference time by half; 2) the proposed 3D spatial feature significantly outperformed (31% CERR) all previous works of using the 1D directional information in both paradigms.

Tags:

speech separation

multi-speaker asr

3d feature

multi-channel asr

audio-visual asr

MULTI-CHANNEL MULTI-SPEAKER ASR USING 3D SPATIAL FEATURE

Yiwen Shao, Shi-Xiong Zhang, Dong Yu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

IMPROVING SEPARATION-BASED SPEAKER DIARIZATION VIA ITERATIVE MODEL REFINEMENT AND SPEAKER EMBEDDING BASED POST-PROCESSING

OFF-THE-SHELF DEEP INTEGRATION FOR RESIDUAL-ECHO SUPPRESSION

Join the IEEE Signal Processing Society