SPEAKERAUGMENT: DATA AUGMENTATION FOR GENERALIZABLE SOURCE SEPARATION VIA SPEAKER PARAMETER MANIPULATION

Kai Wang (Xinjiang University); Yuhang Yang (School of Information Science and Engineering, Xinjiang University, China); Hao Huang (Xinjiang University); Ying Hu (Xinjiang University); Sheng Li (National Institute of Information & Communications Technology (NICT))

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Existing speech separation models based on deep learning typically generalize poorly due to domain mismatch. In this paper, we propose SpeakerAugment (SA), a data augmentation method for generalizable speech separation that aims to increase the diversity of speaker identity in training data, to mitigate speaker mismatch of domain mismatch. The SA consists of two sub-policies: (1) SA-Vocoder, which uses a vocoder to manipulate pitch and formants parameters of speakers. (2) SA-Spectrum, which directly performs pitch shift and time stretch on the spectrum of each speech signal. The SA is simple and effective. Experimental results show that using SA can significantly improve the generalization ability of models, especially for: 1) The training set with fewer speakers, e.g., WSJ0-2mix, or 2) The target test set with complex linguistic conditions, e.g., the TIMIT based test set. Moreover, as a data augmentation method, SA has good potential to be applicable to other speech related tasks. We validate this by applying SA in speech recognition, and experimental results show that the generalization ability is also improved.

Tags:

Audio and speech source separation