Continuous Action Space-based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder

Ryota Komatsu (Tokyo Institute of Technology); Yusuke Kimura (Tokyo Institute of Technology); Takuma Okamoto (National Institute of Information and Communications Technology); Takahiro Shinozaki (Tokyo Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Studies on spoken language acquisition agents aim to understand the mechanism of human language learning and to realize it on computers. Existing open vocabulary agents first perform unsupervised word learning from speech signals to construct a word dictionary as a discrete action space and then conduct reinforcement learning to understand the use of the words in the dictionary through interaction with dialogue partners. A limitation is that they have difficulty pronouncing multi-word utterances. This study proposes an agent that generates multi-word waveform utterances using a continuous action space. The conventional agent uses a vision-focusing mechanism to accelerate dialogue-based learning by guiding the agent's attention to those concepts in its eyesight. In contrast, the proposed agent replaces it with residual sentence embedding combined with vision features used as the action space. The agent consists of speech and image input front-ends, a transformer language model of pseudo-phones, and a speech synthesizer to generate output waveform utterances and uses the deterministic policy gradient instead of the Q-learning to work on the continuous action space. Experimental results show that the agent learns multi-word utterances assisted by unsupervised learning algorithms using unlabeled speech and image data sets.

Tags:

Machine learning methods for language

Continuous Action Space-based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder

Ryota Komatsu (Tokyo Institute of Technology); Yusuke Kimura (Tokyo Institute of Technology); Takuma Okamoto (National Institute of Information and Communications Technology); Takahiro Shinozaki (Tokyo Institute of Technology)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

Egocentric Action Anticipation for Personal Health

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

Join the IEEE Signal Processing Society