End-To-End Spoken Language Understanding Without Matched Language Speech Model Pretraining Data
Ryan Price
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:58
In contrast to conventional approaches to spoken language understanding (SLU) that consist of cascading a speech recognizer with a natural language understanding component, end-to-end (E2E) approaches for SLU infer semantics directly from the speech signal without processing it through separate subsystems. Pretraining part of the E2E models for speech recognition before finetuning the entire model for the target SLU task has proven to be an effective method to address the increased data requirements of E2E SLU models. However, transcribed corpora in the target language and domain may not always be available for pretraining an E2E SLU model. This paper proposes two strategies to improve the performance of E2E SLU models in scenarios where transcribed data for pretraining in the target language is unavailable: multilingual pretraining with mismatched languages and data augmentation using SpecAugment. We demonstrate the effectiveness of these two methods for E2E SLU on two datasets, including one recently released publicly available dataset where we surpass the best previously published despite not using any matched language data for pretraining.