Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

Xilai Li (Amazon); Goeric Huybrechts (Amazon); Srikanth Ronanki (Amazon); Jeff Farris (Amazon); Sravan Babu Bodapati (Amazon)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize the degradation due to streaming. However, the performance gap still remains relatively large between non-streaming and a full-contextual model trained independently. To address this, we propose a dynamic chunk-based convolution replacing the causal convolution in a hybrid Connectionist Temporal Classification (CTC)-Attention Conformer architecture. Additionally, we demonstrate further improvements through initialization of weights from a full-contextual model and parallelization of the convolution and self-attention modules. We evaluate our models on the open-source Voxpopuli, LibriSpeech and in-house conversational datasets. Overall, our proposed model reduces the degradation of the streaming mode over the non-streaming full-contextual model from 41.7% and 45.7% to 16.7% and 26.2% on the LibriSpeech test-clean and test-other datasets respectively, while improving by a relative 15.5% WER over the previous state-of-the-art unified model.

Tags:

Resource constrained speech recognition

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

Xilai Li (Amazon); Goeric Huybrechts (Amazon); Srikanth Ronanki (Amazon); Jeff Farris (Amazon); Sravan Babu Bodapati (Amazon)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Improving Accented Speech Recognition with Multi-Domain Training

Papez: Resource-efficient Speech Separation with Auditory Working Memory

DOMAIN AND LANGUAGE ADAPTATION USING HETEROGENEOUS DATASETS FOR WAV2VEC2.0-BASED SPEECH RECOGNITION OF LOW-RESOURCE LANGUAGE

Join the IEEE Signal Processing Society