CLAP Learning Audio Concepts From Natural Language Supervision

Benjamin Elizalde (Microsoft); Soham Deshmukh (Microsoft); Mahmoud Al Ismail (Microsoft); Huaming Wang (Microsoft)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Mainstream machine listening models are trained to learn audio concepts under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which connects language and audio by using two encoders and a contrastive learning objective, bringing audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 7 domains, such as classification of sound events, scenes, music, and speech. CLAP establishes state-of-the-art (SoTA) in Zero-Shot performance. Also, we evaluated CLAP's audio encoder in a supervised learning setup and achieved SoTA in 5 tasks. The Zero-Shot capability removes the need of training with class labeled audio, enables flexible class prediction at inference time, and generalizes well in multiple downstream tasks.

Tags:

Audio for multimedia and audio processing systems

CLAP Learning Audio Concepts From Natural Language Supervision

Benjamin Elizalde (Microsoft); Soham Deshmukh (Microsoft); Mahmoud Al Ismail (Microsoft); Huaming Wang (Microsoft)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Building Keyword Search System from End-to-End ASR Systems

MUSIC REARRANGEMENT USING HIERARCHICAL SEGMENTATION

Incorporating lip features into audio-visual multi-speaker DOA estimation by gated fusion

Join the IEEE Signal Processing Society