Recurrent Neural Audiovisual Word Embeddings For Synchronized Speech And Real-Time Mri
Murat Saraçlar, Ãykü Deniz Köse
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:59
In this paper, the use of word embeddings for the segments found in audio and real-time magnetic resonance imaging (rtMRI) videos is addressed. In this study, word embeddings are created to store and retrieve data efficiently, and their representation power of the original data is evaluated by the same-different word-discrimination task that is defined for both unimodal and cross-view settings. In order to create the word embeddings for two different data modalities independently for the unimodal setting, a Siamese neural network is designed. For the rtMRI videos, inputs to the network are generated through a correspondence autoencoder. In the cross-view setting, a recurrent neural network (RNN), which inputs data of different modalities, is trained to generate embeddings jointly for both data sources. The problem of objective function selection to the RNN is also investigated. The results on the USC-TIMIT rtMRI dataset outperform the conventional dynamic time warping (DTW) baseline by a clear margin. Outcomes demonstrate that the proposed word embeddings can be a step towards faster unimodal or cross-view query-by-example search tasks.