A Visual-Pilot Deep Fusion For Target Speech Separation In Multi-Talker Noisy Environment
Yun Li, Zhang Liu, Yueyue Na, Ziteng Wang, Biao Tian, Qiang Fu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 11:55
Separating the target speech in multi-talker noisy environment is a challenging problem for audio-only source separation algorithms. The major problem behind is that the separated speech from the same talker can switch among the outputs across consecutive segments, causing the talker permutation issue. In this paper, we deploy face tracking and propose the low-dimension hand-crafted visual features and the low-cost deep fusion architectures to separate the unseen but visible target sources in multi-talker noisy environment. It is shown that our approach is not only capable of addressing the talker permutation issue but also producing additional separation improvement in challenging mixtures such as the same-gender overlapping ones on the public dataset. We also show that the significant improvement of the target speech recognition is achieved on the simulated real-world dataset. Our training is independent of the number of visible sources providing flexibility in deployment.