Skip to main content

Trilingual Semantic Embeddings Of Visually Grounded Speech With Self-Attention Mechanisms

Yasunori Ohishi, Takahito Kawanishi, Akisato Kimura, David Harwath, James Glass, Kunio Kashino

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 12:57
04 May 2020

We propose a trilingual semantic embedding model that associates visual objects in images with segments of speech signals corresponding to spoken words in an unsupervised manner. Unlike the existing models, our model incorporates three different languages, namely, English, Hindi, and Japanese. To build the model, we used the existing English and Hindi datasets and collected a new corpus of Japanese speech captions. These spoken captions are spontaneous descriptions by individual speakers, rather than readings based on prepared transcripts. Therefore, we introduce a self-attention mechanism into the model to better map the spoken captions associated with the same image into the embedding space. We hope that the self-attention mechanism efficiently captures relationships between widely separated word-like segments. Experimental results show that the introduction of a third language improves the average performance in terms of crossmodal and cross-lingual retrieval accuracy, and that the self-attention mechanism added to the model works effectively.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00