Data2vec-SG: Improving Self-supervised Learning Representations for Speech Generation Tasks
Heming Wang (The Ohio State University); Yao Qian (Microsoft); Hemin Yang (Microsoft); Naoyuki Kanda (Microsoft); Peidong Wang (Microsoft); Takuya Yoshioka (Microsoft); Xiaofei Wang (Microsoft); Yiming Wang (Microsoft Corporation); Shujie Liu (Microsoft Research Asia); Zhuo Chen (Microsoft); DeLiang Wang (Ohio State University); Michael Zeng (Microsoft)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Self-supervised learning has been successfully applied to various speech recognition and understanding tasks. However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. Our data2vec-SG introduces a reconstruction module into data2vec and enforces the representations to contain not only the semantic information but also the acoustic knowledge to generate clean speech waveforms. Experiments demonstrate that the proposed framework boosts the performance of various speech generation tasks including speech enhancement, speech separation, and packet loss concealment. Meanwhile, the learned representation is also capable of helping other downstream tasks, which is demonstrated by the good performance in the speech recognition task in both clean and noisy conditions.