AUDIOCLIP: EXTENDING CLIP TO IMAGE, TEXT AND AUDIO
Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:35
The rapidly evolving field of sound classification has greatly benefited from the methods of other domains. Today, the trend is to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. We present AudioCLIP ? an extension of the CLIP model that handles audio in addition to text and images. Utilizing the AudioSet dataset, our proposed model incorporates the ESResNeXt audio-model into the CLIP framework, thus enabling it to perform multimodal classification and keeping CLIP?s zero-shot capabilities. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15% on ESC-50 and 90.07% on UrbanSound8K. Further, it sets new baselines in the zero-shot ESC-task on the same datasets (69.40% and 68.78%, respectively). We also asses the influence of different training setups on the final performance of the proposed model. For the sake of reproducibility, our code is published.