Skip to main content

AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:05:35
13 May 2022

The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP ? an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP?s zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15% on ESC-50 and 90.07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69.40% and 68.78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.

More Like This

01 Feb 2024

P1.4-Classification

1.00 pdh 0.10 ceu
  • SPS
    Members: Free
    IEEE Members: Free
    Non-members: Free
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00