Multi-Channel Speaker Extraction with Adversarial Training: the Wavlab submission to the Clarity ICASSP 2023 Grand Challenge

Samuele Cornell (Università Politecnica delle Marche); Zhong-Qiu Wang (Carnegie Mellon University); Yoshiki Masuyama (Tokyo Metropolitan University); Shinji Watanabe (Carnegie Mellon University); Manuel Pariente (Pulse Audition); Nobutaka Ono (Tokyo Metropolitan University); Stefano Squartini (Università Politecnica delle Marche)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

10 Jun 2023

In this work we detail our submission to the Clarity ICASSP 2023 grand challenge, in which participants have to develop a strong target speech enhancement system for hearing-aid (HA) devices in noisy-reverberant environments. Our system builds on our previous submission at the Second Clarity Enhancement Challenge (CEC2): iNeuBe-X, which consists in an iterative neural/conventional beamforming enhancement pipeline, guided by an enrollment utterance from the target speaker. This model, which won by a large margin the CEC2, is an extension of the state-of-the-art TF-GridNet model for multi-channel, stream-able target-speaker speech enhancement. Here, this approach is extended and further improved by leveraging generative adversarial training, which we show proves especially useful when the training data is limited. Using only the official 6k training scenes data, our best model achieves 0.80 hearing-aid speech perception index (HASPI) and 0.41 hearing-aid speech quality index (HASQI) scores on the synthetic evaluation set. However, our model generalized poorly on the semi-real evaluation set. This highlights the fact that our community should focus more on real-world evaluation and less on fully synthetic datasets.

Tags:

Signal Processing for Communications and Networking

Multi-Channel Speaker Extraction with Adversarial Training: the Wavlab submission to the Clarity ICASSP 2023 Grand Challenge

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

A Transformer-Based E2E SLU model for Improved Semantic Parsing

THE NERCSLIP-USTC SYSTEM FOR THE L3DAS23 CHALLENGE TASK2: 3D SOUND EVENT LOCALIZATION AND DETECTION (SELD)

Agile Radio Map Prediction Using Deep Learning

Join the IEEE Signal Processing Society