Unsupervised Pre-Training Of Bidirectional Speech Encoders Via Masked Reconstruction

Weiran Wang, Qingming Tang, Karen Livescu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 13:42

04 May 2020

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results, with about 15\% relative improvements in word error rate over a typical baseline speech recognizer. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Unsupervised Pre-Training Of Bidirectional Speech Encoders Via Masked Reconstruction

Weiran Wang, Qingming Tang, Karen Livescu

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society