Multistate Encoding With End-To-End Speech Rnn Transducer Network

Zelin Wu, Bo Li, Yu Zhang, Petar Aleksic, Tara Sainath

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 09:47

04 May 2020

Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size. In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Multistate Encoding With End-To-End Speech Rnn Transducer Network

Zelin Wu, Bo Li, Yu Zhang, Petar Aleksic, Tara Sainath

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society