INTEGRATION OF PRE-TRAINED NETWORKS WITH CONTINUOUS TOKEN INTERFACE FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Seunghyun Seo, Donghyun Kwak, Bowon Lee
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:00
Most End-to-End (E2E) Spoken Language Understanding (SLU) networks leverage the pre-trained Automatic Speech Recognition (ASR) networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained Natural Language Understanding (NLU) networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation (KD), cross-modal shared embedding and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with a novel Interface, Continuous Token Interface (CTI). CTI is a junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on intent classification and slot filling tasks. We also verify that the NLU network, pre-trained with Masked Language Model (MLM), can utilize a noisy textual representation of CTI. Moreover, we train our model with extra data, SLURP-Synth, and get better results.