AVES: Animal Vocalization Encoder based on Self-Supervision

Masato Hagiwara (Earth Species Project)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

The lack of annotated training data in bioacoustics hinders the use of large-scale neural network models trained in a supervised way. In order to leverage a large amount of unannotated audio data, we propose AVES (Animal Vocalization Encoder based on Self-Supervision), a self-supervised, transformer-based audio representation model for encoding animal vocalizations. We pretrain AVES on a diverse set of unannotated audio datasets and fine-tune them for downstream bioacoustics tasks. Comprehensive experiments with a suite of classification and detection tasks have shown that AVES outperforms all the strong baselines and even the supervised "topline" models trained on annotated audio classification datasets. The results also suggest that curating a small training subset related to downstream tasks is an efficient way to train high-quality audio representation models. We open-source our models.

Tags:

Detection and classification of acoustic scenes and events

AVES: Animal Vocalization Encoder based on Self-Supervision

Masato Hagiwara (Earth Species Project)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

GraphIT: Iterative reweighted l1 algorithm for sparse graph inference in state-space models

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

AN EXPERIMENTAL STUDY ON SOUND EVENT LOCALIZATION AND DETECTION UNDER REALISTIC TESTING CONDITIONS

Join the IEEE Signal Processing Society