Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

Stefan Braun (Apple); Erik McDermott (Apple); Roger Hsiao (Apple)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

The neural transducer is an end-to-end model for automatic speech recognition (ASR). While the model is well-suited for streaming ASR, the training process remains challenging. During training, the memory requirements may quickly exceed the capacity of state-of-the-art GPUs, limiting batch size and sequence lengths. In this work, we analyze the time and space complexity of a typical transducer training setup. We propose a memory-efficient training method that computes the transducer loss and gradients sample by sample. We present optimizations to increase the efficiency and parallelism of the sample-wise method. In a set of thorough benchmarks, we show that our sample-wise method significantly reduces memory usage, and performs at competitive speed when compared to the default batched computation. As a highlight, we manage to compute the transducer loss and gradients for a batch size of 1024, and audio length of 40 seconds, using only 6 GB of memory.

Tags:

Word spotting, VAD, and other topics in speech recognition

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

Stefan Braun (Apple); Erik McDermott (Apple); Roger Hsiao (Apple)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

FEDERATED LEARNING FOR ASR BASED ON WAV2VEC 2.0

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

WeKws: A production first small-footprint end-to-end Keyword Spotting Toolkit

Join the IEEE Signal Processing Society