DE’HUBERT: DISENTANGLING NOISE IN A SELF-SUPERVISED MODEL FOR ROBUST SPEECH RECOGNITION
Dianwen Ng (Alibaba Group/Nanyang Technological University); Ruixi Zhang (National University of Singapore); Jia Qi Yip (Alibaba Group); Zhao Yang (Xi'an Jiaotong University); Jinjie Ni (Nanyang Technological University); Chong Zhang (Alibaba Group); Yukun Ma (Alibaba Group); Chongjia Ni (Alibaba); Eng Siong Chng (Nanyang Technological University); Bin Ma ("Alibaba, Singapore R&D Center")
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a single source, which tends to do poorly when noise is present during testing. Nonetheless, it is crucial to overcome the adverse influence of noise for real-world applications. In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow’s redundancy-reduction principle. The new framework improves the HuBERT training algorithm by introducing auxiliary losses that drive the self- and cross-correlation ma- trix between pairwise noise-distorted embeddings towards identity matrix. This encourages the model to produce noise- agnostic speech representations. With this method, we report improved robustness in noisy environments, including unseen noises, without impairing the performance on the clean set.