Toward Better Speaker Embeddings: Automated Collection Of Speech Samples From Unknown Distinct Speakers
Minh Pham, Zeqian Li, Jacob Whitehill
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:16
The accuracy of speaker verification and diarization models depends on the quality of the speaker embeddings used to separate audio samples from different speakers. With the goal of training better embedding models, we devise an au- tomatic pipeline for large-scale collection of speech samples from unique speakers that is significantly more automated than previous approaches. With this pipeline, we collect and publish the BookTubeSpeech dataset, containing 8,450 YouTube videos (7.74 min per video on average) that each contains a single unique speaker. Using this dataset combined with VoxCeleb2, we show a substantial improvement in the quality of embeddings when tested on LibriSpeech compared to a model trained on only VoxCeleb2.