HuBERT-AGG: Aggregated Representation Distillation of Hidden-unit BERT for Robust Speech Recognition
wei wang (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-supervised learning (SSL) has attracted widespread research interest since many successful SSL approaches such as wav2vec 2.0 and Hidden-unit BERT (HuBERT) have achieved promising results on speech-related tasks such as automatic speech recognition (ASR) tasks. However, few works have been conducted to improve the noise robustness of SSL models. In this paper, we propose HuBERT-AGG, a novel method that learns noise-invariant SSL representations for robust speech recognition by distilling aggregated layer-wise representations. Specifically, we learn an aggregator that computes the weighted sum of all hidden states of a pretrained vanilla HuBERT by fine-tuning it on a small portion of labeled data. Then a noise-robust HuBERT is trained on the simulated noisy speech by distilling from the aggregated representations and layer-wise hidden states produced by a pretrained vanilla HuBERT with parallel original speech as input. Experiments on LibriSpeech simulated noisy test sets show 13.1%-17.0% relative word error rate (WER) reduction with very slight degradation on the original test sets. On CHiME-4 1-channel real speech test sets, we have surpassed the best results achieved by all published fully supervised ASR models as well as other SSL approaches adopting the same data usage as ours.