Self supervised representation learning with deep clustering for acoustic unit discovery from raw speech

Varun Krishna, Sriram Ganapathy

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:14

08 May 2022

The automatic discovery of acoustic sub-word units from raw speech, without any text or labels, is a growing field of research. The key challenge is to derive representations of speech that can be categorized into a small number of phoneme-like units which are speaker invariant and can broadly capture the content variability of speech. In this work, we propose a novel neural network paradigm that uses the deep clustering loss along with the autoregressive contrastive predictive coding (CPC) loss. Both the loss functions, the CPC and the clustering loss, are self-supervised. The clustering cost involves the loss function using the phoneme-like labels generated with an iterative k-means algorithm. The inclusion of this loss ensures that the model representations can be categorized into a small number of automatic speech units. We experiment with several sub-tasks described as part of the Zerospeech 2021 challenge to illustrate the effectiveness of the framework. In these experiments, we show that proposed representation learning approach improves significantly over the previous self-supervision based models as well as the wav2vec family of models on a range of word-level similarity tasks and language modeling tasks.

Tags:

representation learning

self-supervised learning

zerospeech challenge.

deep clustering

contrastive predictive coding

Self supervised representation learning with deep clustering for acoustic unit discovery from raw speech

Varun Krishna, Sriram Ganapathy

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICIP 2023 COURSE 2: Short Course: Unboxing Advancements in Biomedical Image Processing (Parts 1-4)

Tutorial: Understanding Deep Representation Learning via Neural Collapse

The Changing Landscape of Speech Foundation Models

Join the IEEE Signal Processing Society