Beam-Tasnet: Time-Domain Audio Separation Network Meets Frequency-Domain Beamformer
Tsubasa Ochiai, Marc Delcroix, Rintaro Ikeshita, Keisuke Kinoshita, Shoko Araki, Tomohiro Nakatani
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 07:15
Recent studies have shown that acoustic beamforming using a microphone array plays an important role in the construction of high-performance automatic speech recognition (ASR) systems, especially for noisy and overlapping speech conditions. In parallel with the success of multichannel beamforming for ASR, in the speech separation field, the time-domain audio separation network (TasNet), which accepts a time-domain mixture as input and directly estimates the time-domain waveforms for each source, achieves remarkable speech separation performance. In light of these two recent trends, the question of whether TasNet can benefit from beamforming to achieve high ASR performance in overlapping speech conditions naturally arises. Motivated by this question, this paper proposes a novel speech separation scheme, i.e., Beam-TasNet, which combines TasNet with the frequency-domain beamformer, i.e., a minimum variance distortionless response (MVDR) beamformer, through spatial covariance computation to achieve better ASR performance. Experiments on the spatialized WSJ0-2mix corpus show that our proposed Beam-TasNet significantly outperforms the conventional TasNet without beamforming and, moreover, successfully achieves a word error rate comparable to an oracle mask-based MVDR beamformer.