Group Masked Model Learning for General Audio Representation
Sara Atito, Muhammed Awais, Tony Alex, Josef Kittler
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Vision transformers have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. However, transformers are known to be data hungry which require orders of magnitude more data to train. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose Audio-GMML, a self-supervised transformer for general audio representations that is based on Group Masked Model Learning (GMML) and a patch aggregation strategy to improve the performance of learned representations and enforce global structure of the the given audio. We evaluate our pretrained models on several downstream tasks, setting a new state-of-the-art performance on five audio and speech classification tasks. The code and pretrained weights will be made publicly available for the scientific community.