Efficient Domain Adaptation for Speech Foundation Models

Bo Li (Google); Dongseong Hwang (Google); Zhouyuan Huo (Google ); Junwen Bai (Google); Guru Prakash Arumugam (Google LLC); Tara Sainath (Google); Khe C Sim (Google Inc.); Yu Zhang (Google); Wei Han (Google); Trevor Strohman (Google); Françoise Beaufays (Google)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and extend the joint training strategy JUST Hydra for finetuning using both source and unsupervised target domain data. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.

Tags:

Robust speech recognition and adaptation

Efficient Domain Adaptation for Speech Foundation Models

Bo Li (Google); Dongseong Hwang (Google); Zhouyuan Huo (Google ); Junwen Bai (Google); Guru Prakash Arumugam (Google LLC); Tara Sainath (Google); Khe C Sim (Google Inc.); Yu Zhang (Google); Wei Han (Google); Trevor Strohman (Google); Françoise Beaufays (Google)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DATA2VEC-AQC: SEARCH FOR THE RIGHT TEACHING ASSISTANT IN THE TEACHER-STUDENT TRAINING SETUP

BENCHMARK OF PHYSIOLOGICAL MODEL BASED AND DEEP LEARNING BASED REMOTE PHOTOPLETHYSMOGRAPHY IN AUTOMOTIVE

FAST AND PARALLEL DECODING FOR TRANSDUCER

Join the IEEE Signal Processing Society