SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Ashish Seth (IIT Madras); Sreyan Ghosh (University of Maryland, College Park); S Umesh (IIT Chennai); Dinesh Manocha (University of Maryland at College Park)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations) which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark, significantly outperforming all other prior approaches, sometimes pre-trained on 10x larger unsupervised data than our setting.

Tags:

Deep learning techniques

SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Ashish Seth (IIT Madras); Sreyan Ghosh (University of Maryland, College Park); S Umesh (IIT Chennai); Dinesh Manocha (University of Maryland at College Park)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Adaptive Scale and Spatial Aggregation for Real-time Object Detection

Training Robust Spiking Neural Networks with ViewPoint Transform and SpatioTemporal Stretching

CryoSWD: Sliced Wasserstein Distance Minimization for 3D Reconstruction in Cryo-Electron Microscopy

Join the IEEE Signal Processing Society