Robust self-supervised speaker representation learning via instance mix regularization
Woohyun Kang, Jahangir Alam, Abderrahim Fathan
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:56
Over the recent years, various self-supervised contrastive embedding learning methods for deep speaker verification were proposed. The performance of the self-supervised contrastive learning framework highly depends on the data augmentation technique, but due to the sensitive nature of speaker information within the speech signal, most speaker embedding training relies on simple augmentations such as additive noise or simulated reverberation. Thus while the conventional self-supervised speaker embedding systems can yield minimum within-utterance variability, the capability to generalize to out-of-set utterance is limited. In order to alleviate this problem, we propose a novel self-supervised learning framework for speaker verification which combines the angular prototypical loss and the instance mix (i-mix) regularization. The proposed method was evaluated on the VoxCeleb1 dataset and showed noticeable improvement over the standard self-supervised embedding method.