Skip to main content

Group Masked Model Learning for General Audio Representation

Sara Atito, Muhammed Awais, Tony Alex, Josef Kittler

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
Poster 10 Oct 2023

Vision transformers have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. However, transformers are known to be data hungry which require orders of magnitude more data to train. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representation of the audio spectrograms. In this paper, we propose Audio-GMML, a self-supervised transformer for general audio representations that is based on Group Masked Model Learning (GMML) and a patch aggregation strategy to improve the performance of learned representations and enforce global structure of the the given audio. We evaluate our pretrained models on several downstream tasks, setting a new state-of-the-art performance on five audio and speech classification tasks. The code and pretrained weights will be made publicly available for the scientific community.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00