Efficient Personalized Speech Enhancement Through Self-Supervised Learning
Aswin Sivaraman (Indiana University Bloomington); Minje Kim (Indiana University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This work presents self-supervised learning methods for monaural speaker-specific (i.e., personalized) speech enhancement models. While general-purpose models must broadly address many speakers, personalized models can adapt to a particular speaker's voice, expecting to solve a narrower problem. Hence, personalization can achieve more optimal performance in addition to reducing computational complexity. However, naive personalization methods can inconveniently require clean speech from the target user, e.g., due to subpar recording conditions. To this end, we pose personalization as either a zero-shot task, in which no clean speech of the target speaker is used, or a few-shot learning task, which is to minimize the duration of the clean speech used for transfer learning. With this paper, we propose self-supervised learning methods as a solution to both zero- and few-shot personalization tasks. The proposed methods learn the personalized speech features from unlabeled data (i.e., in-the-wild noisy recordings from the target user) rather than from the clean sources. We investigate three different self-supervised learning mechanisms. We set up a pseudo speech enhancement problem as a pretext task, which pretrains the models to estimate noisy speech as if it were the clean target. Contrastive learning and data purification methods regularize the loss function of the pseudo enhancement problem, overcoming the limitations of learning from unlabeled data. We assess our methods by personalizing the well-known ConvTasNet architecture to twenty different target speakers. The results show that self-supervision-based personalization improves the original ConvTasNet's enhancement quality with fewer model parameters and less clean data from the target user.