Improving Identification Of System-Directed Speech Utterances By Deep Learning Of Asr-Based Word Embeddings And Confidence Metrics

Vilayphone Vilaysouk, Amr Nour-Eldin, Dermot Connolly

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:15:47

09 Jun 2021

In this paper, we extend our previous work on the detection of system-directed speech utterances. This type of binary classification can be used by virtual assistants to create a more natural and fluid interaction between the system and the user. We explore two methods that both improve the Equal-Error-Rate (EER) performance of the previous model. The first exploits the supplementary information independently captured by ASR models through integrating ASR decoder-based features as additional inputs to the final classification stage of the model. This relatively improves EER performance by 13%. The second proposed method further integrates word embeddings into the architecture and, when combined with the first method, achieves a significant EER performance improvement of 48%, relative to that of the baseline.

Chairs:

Yifan Gong

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021