Audio-Text Models Do Not Yet Leverage Natural Language

Ho-Hsiang Wu (New York University); Oriol Nieto (Pandora); Juan P Bello (New York University); Justin Salamon (Adobe Research)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.

Tags:

Audio for multimedia and audio processing systems

Audio-Text Models Do Not Yet Leverage Natural Language

Ho-Hsiang Wu (New York University); Oriol Nieto (Pandora); Juan P Bello (New York University); Justin Salamon (Adobe Research)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Building Keyword Search System from End-to-End ASR Systems

MUSIC REARRANGEMENT USING HIERARCHICAL SEGMENTATION

Incorporating lip features into audio-visual multi-speaker DOA estimation by gated fusion

Join the IEEE Signal Processing Society