Top-Down Attention In End-To-End Spoken Language Understanding
Yixin Chen, Weiyi Lu, Alejandro Mottini, Li Erran Li, Jasha Droppo, Zheng Du, Belinda Zeng
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:07
Spoken language understanding (SLU) is the task of inferring the semantics of spoken utterances. Traditionally, this has been achieved with a cascading combination of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) modules that are optimized separately, which can lead to a suboptimal overall performance. Recently, End-to-End SLU (E2E SLU) was proposed, performing SLU directly from speech through a joint optimization of the modules, addressing some of the traditional SLU shortcomings. A key challenge of this approach is how to best integrate the feature learning of the ASR and NLU sub-tasks to maximize performance. While generally, ASR models focus on low-level features, and NLU models need higher-level contextual information, ASR models can nonetheless also leverage top-down syntactic and semantic information to improve their recognition. Based on this insight, we propose Top-Down SLU (TD-SLU), a new transformer-based E2E SLU model that uses top-down attention and an attention gate to fuse high-level NLU features with low-level ASR features. We have validated our model using the FluentSpeech set and a large internal dataset. Results show TD-SLU is able to outperform selected baselines in terms of ASR and NLU quality metrics, and suggest that the added high-level information can improve the model's performance.
Chairs:
Karen Livescu