Improved End-To-End Spoken Utterance Classification With A Self-Attention Acoustic Classifier
Ryan Price, Mahnoosh Mehrabani, Srinivas Bangalore
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 13:09
While human language provides a natural interface for human-machine communication, there are several challenges concerning extracting the intents of a speaker when interacting with a virtual agent, especially when the speaker is in a noisy acoustic environment, that still remains to be solved. In this paper, we propose a new architecture for end-to-end spoken utterance classification (SUC) and also explore the impact of leveraging lexical information in conjunction with acoustic information obtained from the end-to-end model for SUC. We demonstrate that strong performance can be obtained by the model with acoustic features alone compared to a text classifier on ASR outputs. Furthermore, when acoustic and lexical embeddings from these classifiers are combined, accuracy that is on par with human agents can be achieved.