Multi-Channel Speaker Extraction with Adversarial Training: the Wavlab submission to the Clarity ICASSP 2023 Grand Challenge
Samuele Cornell (Università Politecnica delle Marche); Zhong-Qiu Wang (Carnegie Mellon University); Yoshiki Masuyama (Tokyo Metropolitan University); Shinji Watanabe (Carnegie Mellon University); Manuel Pariente (Pulse Audition); Nobutaka Ono (Tokyo Metropolitan University); Stefano Squartini (Università Politecnica delle Marche)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In this work we detail our submission to the Clarity ICASSP 2023 grand challenge, in which participants have to develop a strong target speech enhancement system for hearing-aid (HA) devices in noisy-reverberant environments. Our system builds on our previous submission at the Second Clarity Enhancement Challenge (CEC2): iNeuBe-X, which consists in an iterative neural/conventional beamforming enhancement pipeline, guided by an enrollment utterance from the target speaker. This model, which won by a large margin the CEC2, is an extension of the state-of-the-art TF-GridNet model for multi-channel, stream-able target-speaker speech enhancement. Here, this approach is extended and further improved by leveraging generative adversarial training, which we show proves especially useful when the training data is limited. Using only the official 6k training scenes data, our best model achieves 0.80 hearing-aid speech perception index (HASPI) and 0.41 hearing-aid speech quality index (HASQI) scores on the synthetic evaluation set. However, our model generalized poorly on the semi-real evaluation set. This highlights the fact that our community should focus more on real-world evaluation and less on fully synthetic datasets.