UNIFIED SPECULATION, DETECTION, AND VERIFICATION KEYWORD SPOTTING

Geng-shen Fu, Thibaud Senechal, Aaron Challenner, Tao Zhang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:01

11 May 2022

Accurate and timely recognition of the trigger keyword is vital for a good customer experience on smart devices. In the traditional keyword spotting task, there is typically a trade-off needed between accuracy and latency, where higher accuracy can be achieved by waiting for more context. In this paper, we propose a deep learning model that separates the keyword spotting task into three phases in order to further optimize both accuracy and latency of the overall system. These three tasks are: Speculation, Detection, and Verification. Speculation makes an early decision, which can be used to give a head-start to downstream processes on the device such as local speech recognition. Next, Detection mimics the traditional keyword trigger task and gives a more accurate decision by observing the full keyword context. Finally, Verification verifies previous decision by observing even more audio after the keyword span. We propose a latency-aware max-pooling loss function that can train a unified model for these three tasks by tuning for different latency targets within the same model. We empirically show that the resultant unified model can accommodate these tasks with desirable performance and without requiring additional compute or memory resources.

Tags:

keyword spotting

max-pooling loss

accuracy latency trade-off

crnn

multi-task learning