Data2vec-SG: Improving Self-supervised Learning Representations for Speech Generation Tasks

Heming Wang (The Ohio State University); Yao Qian (Microsoft); Hemin Yang (Microsoft); Naoyuki Kanda (Microsoft); Peidong Wang (Microsoft); Takuya Yoshioka (Microsoft); Xiaofei Wang (Microsoft); Yiming Wang (Microsoft Corporation); Shujie Liu (Microsoft Research Asia); Zhuo Chen (Microsoft); DeLiang Wang (Ohio State University); Michael Zeng (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Self-supervised learning has been successfully applied to various speech recognition and understanding tasks. However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. Our data2vec-SG introduces a reconstruction module into data2vec and enforces the representations to contain not only the semantic information but also the acoustic knowledge to generate clean speech waveforms. Experiments demonstrate that the proposed framework boosts the performance of various speech generation tasks including speech enhancement, speech separation, and packet loss concealment. Meanwhile, the learned representation is also capable of helping other downstream tasks, which is demonstrated by the good performance in the speech recognition task in both clean and noisy conditions.

Tags:

Robust speech recognition and adaptation

Data2vec-SG: Improving Self-supervised Learning Representations for Speech Generation Tasks

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DATA2VEC-AQC: SEARCH FOR THE RIGHT TEACHING ASSISTANT IN THE TEACHER-STUDENT TRAINING SETUP

BENCHMARK OF PHYSIOLOGICAL MODEL BASED AND DEEP LEARNING BASED REMOTE PHOTOPLETHYSMOGRAPHY IN AUTOMOTIVE

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Join the IEEE Signal Processing Society