Streaming Multi-channel Speech Separation with Online Time-domain Generalized Wiener Filter

Yi Luo (Tencent AI Lab)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Most existing streaming neural-network-based multi-channel speech separation systems consist of a causal network architecture and an online spatial information extraction module. The spatial information extraction module can either be a feature calculation module that generates cross-channel features or an online beamforming module that explicitly performs frame- or chunk-level spatial filtering. While such online beamforming modules were mainly proposed in the frequency domain, recent literature have investigated the potential of learnable time-domain methods which can be jointly optimized with the entire model with a single training objective. Among those methods, the time-domain generalized Wiener filter (TD-GWF) has shown performance gain compared to conventional frequency-domain beamformers in the sequential beamforming pipeline. In this paper, we modify the offline TD-GWF to an online counterpart via a Sherman–Morrison formula-based approximation and introduce how we simplify and stabilize the training phase. Experiment results on applying various offline and online spatial filtering modules in the sequential beamforming pipeline show that the online TD-GWF can obtain better performance than an offline frequency-domain multi-channel Wiener filter (FD-MCWF) in the noisy multi-channel reverberant speech separation task.

Tags:

Speaker recognition/identification/diarization