Group Masked Model Learning for General Audio Representation

Sara Atito, Muhammed Awais, Tony Alex, Josef Kittler

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 10 Oct 2023

Vision transformers have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. However, transformers are known to be data hungry which require orders of magnitude more data to train. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose Audio-GMML, a self-supervised transformer for general audio representations that is based on Group Masked Model Learning (GMML) and a patch aggregation strategy to improve the performance of learned representations and enforce global structure of the the given audio. We evaluate our pretrained models on several downstream tasks, setting a new state-of-the-art performance on five audio and speech classification tasks. The code and pretrained weights will be made publicly available for the scientific community.

Tags:

Audio Spectrograms

self-supervised learning

vision transformers

GMML.

Group Masked Model Learning for General Audio Representation

Sara Atito, Muhammed Awais, Tony Alex, Josef Kittler

More Like This

Short Course Bundle: ICIP 2023 COURSE 2: Short Course: Unboxing Advancements in Biomedical Image Processing (Parts 1-4)

Slides: The Changing Landscape of Speech Foundation Models

The Changing Landscape of Speech Foundation Models

Join the IEEE Signal Processing Society