Unsupervised Pre-Training Of Bidirectional Speech Encoders Via Masked Reconstruction
Weiran Wang, Qingming Tang, Karen Livescu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 13:42
We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results, with about 15\% relative improvements in word error rate over a typical baseline speech recognizer. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.