TIE YOUR EMBEDDINGS DOWN: CROSS-MODAL LATENT SPACES FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING

Bhuvan Agrawal, Markus Müller, Samridhi Choudhary, Martin Radfar, Athanasios Mouchtaris, Ross McGowan, Nathan Susanj, Siegfried Kunzmann

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:50

10 May 2022

End-to-end (E2E) spoken language understanding (SLU) systems can infer the semantics of a spoken utterance directly from an audio signal. However, training an E2E system remains a challenge, largely due to the scarcity of paired audio-semantics data. In this paper, we consider an E2E system as a multi-modal model, with audio and text functioning as its two modalities, and use a cross-modal latent space (CMLS) architecture, where a shared latent space is learned between the 'acoustic' and 'text' embeddings. We propose using different multi-modal losses to explicitly align the acoustic embedding to the text embeddings (obtained via a semantically powerful pre-trained BERT model) in the latent space. We train the CMLS model on two publicly available E2E datasets and one internal dataset, across different cross-modal losses. Our proposed triplet loss function achieves the best performance. It achieves a relative improvement of 22.1% over an E2E model without a cross-modal space and a relative improvement of 2.8% over a previously published CMLS model using L2 loss on our internal dataset.

Tags:

signal to interpretation

end-to-end neural model

cross-modal learning

spoken language understanding

TIE YOUR EMBEDDINGS DOWN: CROSS-MODAL LATENT SPACES FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING

Bhuvan Agrawal, Markus Müller, Samridhi Choudhary, Martin Radfar, Athanasios Mouchtaris, Ross McGowan, Nathan Susanj, Siegfried Kunzmann

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

LEARNING TORSO PRIOR FOR CO-SPEECH GESTURE GENERATION WITH BETTER HAND SHAPE

Video Question Answering using Clip-guided Visual-text Attention

Learning Music Sequence Representation from Text Supervision

Join the IEEE Signal Processing Society