Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 12:16
04 May 2020

The accuracy of speaker verification and diarization models depends on the quality of the speaker embeddings used to separate audio samples from different speakers. With the goal of training better embedding models, we devise an au- tomatic pipeline for large-scale collection of speech samples from unique speakers that is significantly more automated than previous approaches. With this pipeline, we collect and publish the BookTubeSpeech dataset, containing 8,450 YouTube videos (7.74 min per video on average) that each contains a single unique speaker. Using this dataset combined with VoxCeleb2, we show a substantial improvement in the quality of embeddings when tested on LibriSpeech compared to a model trained on only VoxCeleb2.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00