MINING HARD SAMPLES LOCALLY AND GLOBALLY FOR IMPROVED SPEECH SEPARATION

Kai Wang, Yizhou Peng, Hao Huang, Ying Hu, Sheng Li

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:05

08 May 2022

Speech separation dataset typically consists of hard and non-hard samples, and the former is minority and latter majority. The data imbalance problem biases the model towards non-hard samples and weakens the generalization capability. Given that the average separation performance is sufficiently good, improving hard samples may contribute more to back-end tasks. In this paper, we propose two methods to alleviate data imbalance in speech separation task, based on local and global hard sample mining. For the local, we propose weighted loss to compensate for hard samples by increasing their weights in each batch. For the global, we perform global hard sample mining and re-sample to increase the proportion of hard samples in the training set. Because hard sample mining using objective loss in dynamic mixing leads to local results, we propose an indirect method using speaker-specific parameters, based on the fact that pitch median difference and x-vector cosine distance of two speakers in a mixture are closely correlated with separation SI-SNRi. Experimental results show that both methods decrease the percentage of hard samples in the test set than using dynamic mixing only while keeping the average SI-SNRi comparable, and the global method shows more promising results than the local one.

Tags:

data imbalance

hard sample mining

weighted loss

dynamic mixing

speech separation

MINING HARD SAMPLES LOCALLY AND GLOBALLY FOR IMPROVED SPEECH SEPARATION

Kai Wang, Yizhou Peng, Hao Huang, Ying Hu, Sheng Li

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Conversational Speech Processing and Recognition: Speech Separation, End-to-End Modeling, and Speaker Diarization

HARD SAMPLES BASED MARGIN LOSS FOR FACE VERIFICATION

No More Than 6ft Apart: Robust K-Means via Radius Upper Bounds

Join the IEEE Signal Processing Society