Multi-output RNN-T Joint Networks for Multi-task Learning of {ASR} and Auxiliary Tasks

Weiran Wang (Google); Ding Zhao (Google); Shaojin Ding (Google); Hao Zhang (Google); Shuo-yiin Chang (Google); David Rybach (Google); Tara Sainath (Google); Yanzhang He (Google); Ian McGraw (); Shankar Kumar (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We propose a multi-output joint network architecture for RNN-T transducer, for multi-task modeling of ASR and auxiliary tasks that rely on ASR outputs. Each output of the joint network predicts target labels with disjoint vocabularies for each task, while sharing the same audio features by the encoder and language model features by the prediction network. Each task is trained with an RNN-T loss that marginalizes over all possible paths, and we allow multiple tasks to share the blank logit so that they are synchronized. We demonstrate our method on two auxiliary tasks, namely capitalization and pause prediction, and discuss different considerations for modeling and inference procedures. For capitalization, we successfully distill capitalization labels from a stand-alone text normalization model, and achieve competitive Uppercase Error Rate (UER) while offering streaming capability and improved inference efficiency. In addition, our model has similar capitalization accuracy compared to a mixed-case ASR model, but obtains improved WERs if integrated with external language models. For pause prediction, we achieve the same performance as the previous two-step approach while providing a simpler training recipe without affecting ASR accuracy.

Tags:

New algorithms and approaches for speech recognition

Multi-output RNN-T Joint Networks for Multi-task Learning of {ASR} and Auxiliary Tasks

Weiran Wang (Google); Ding Zhao (Google); Shaojin Ding (Google); Hao Zhang (Google); Shuo-yiin Chang (Google); David Rybach (Google); Tara Sainath (Google); Yanzhang He (Google); Ian McGraw (); Shankar Kumar (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Noise-aware target extension with self-distillation for robust speech recognition

PRACTICE OF THE CONFORMER ENHANCED AUDIO-VISUAL HUBERT ON MANDARIN AND ENGLISH

Join the IEEE Signal Processing Society