Continuous Action Space-based Spoken Language Acquisition Agent Using Residual Sentence Embedding and Transformer Decoder
Ryota Komatsu (Tokyo Institute of Technology); Yusuke Kimura (Tokyo Institute of Technology); Takuma Okamoto (National Institute of Information and Communications Technology); Takahiro Shinozaki (Tokyo Institute of Technology)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Studies on spoken language acquisition agents aim to understand the mechanism of human language learning and to realize it on computers. Existing open vocabulary agents first perform unsupervised word learning from speech signals to construct a word dictionary as a discrete action space and then conduct reinforcement learning to understand the use of the words in the dictionary through interaction with dialogue partners. A limitation is that they have difficulty pronouncing multi-word utterances. This study proposes an agent that generates multi-word waveform utterances using a continuous action space. The conventional agent uses a vision-focusing mechanism to accelerate dialogue-based learning by guiding the agent's attention to those concepts in its eyesight. In contrast, the proposed agent replaces it with residual sentence embedding combined with vision features used as the action space. The agent consists of speech and image input front-ends, a transformer language model of pseudo-phones, and a speech synthesizer to generate output waveform utterances and uses the deterministic policy gradient instead of the Q-learning to work on the continuous action space. Experimental results show that the agent learns multi-word utterances assisted by unsupervised learning algorithms using unlabeled speech and image data sets.