AN ADAPTER BASED MULTI-LABEL PRE-TRAINING FOR SPEECH SEPARATION AND ENHANCEMENT

Tianrui Wang (Beijing Jiaotong University); Xie Chen (Shanghai Jiaotong University); Zhuo Chen (Microsoft); Shu Yu (SJTU); Weibin Zhu (Beijing Jiaotong University(China))

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In recent years, self-supervised learning (SSL) has achieved tremendous success in various speech tasks due to its power to extract representations from massive unlabeled data. However, compared with tasks such as speech recognition (ASR), the performance improvement from SSL representation in speech separation (SS) and speech enhancement (SE) are considerably smaller. Based on HuBERT, this work investigates improving the SSL model for SS and SE. We first integrate the separation and denoising steps into the masked speech prediction (MSP) loss, resulting in a multiple pseudo-label pre-training scheme, which significantly improves HuBERT's performance on SS and SE but degrades the performance on ASR. To maintain its performance gain on ASR, we further propose an adapter-based architecture for HuBERT's Transformer encoder, where only a few parameters of each layer are adjusted to the multiple pseudo-label MSP while other parameters remain frozen as default HuBERT. Experimental results show that our proposed adapter-based multiple pseudo-label HuBERT can yield consistent and significant performance improvement on all three tasks, with a faster pre-training speed, at only marginal parameters increase.

Tags:

Acoustic modeling for automatic speech recognition

AN ADAPTER BASED MULTI-LABEL PRE-TRAINING FOR SPEECH SEPARATION AND ENHANCEMENT

Tianrui Wang (Beijing Jiaotong University); Xie Chen (Shanghai Jiaotong University); Zhuo Chen (Microsoft); Shu Yu (SJTU); Weibin Zhu (Beijing Jiaotong University(China))

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DELAY-PENALIZED TRANSDUCER FOR LOW-LATENCY STREAMING ASR

Lattice-free Sequence Discriminative Training for Phoneme-based Neural Transducers

Self-Convolution for Automatic Speech Recognition

Join the IEEE Signal Processing Society