Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:16
10 Jun 2021

Current speaker verification models rely on supervised training with massive manually annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive self-supervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation.

Chairs:
Takafumi Koshinaka

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $25.00
    Non-members: $40.00
  • SPS
    Members: Free
    IEEE Members: $25.00
    Non-members: $40.00