Training Keyword Spotters With Limited And Synthesized Speech Data

James Lin, Kevin Kilgour, Dominik Roblek, Matt Sharifi

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 13:32

04 May 2020

With the rise of low power speech-enabled devices, there is a growing demand to quickly produce models for recognizing arbitrary sets of keywords. As with many machine learning tasks, one of the most challenging parts in the model creation process is obtaining a sufficient amount of training data. In this paper, we explore the effectiveness of synthesized speech data in training small, spoken term detection models of around 400k parameters. Instead of training such models directly on the audio or low level features such as MFCCs, we use a pre-trained speech embedding model trained to extract useful features for keyword spotting models. Using this speech embedding, we show that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples. We also show that a model without our speech embeddings would need to be trained on over 4000 real examples to reach the sameaccuracy.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Training Keyword Spotters With Limited And Synthesized Speech Data

James Lin, Kevin Kilgour, Dominik Roblek, Matt Sharifi

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society