AUDIO-TEXT RETRIEVAL IN CONTEXT

Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:15:13

11 May 2022

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.

Tags:

cross-modal

pre-trained model

audio-text retrieval

aggregation

AUDIO-TEXT RETRIEVAL IN CONTEXT

Siyu Lou, Xuenan Xu, Mengyue Wu, Kai Yu

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

CAN AUDIO CAPTIONS BE EVALUATED WITH IMAGE CAPTION METRICS?

MULTI-VIEW AND MULTI-MODAL EVENT DETECTION UTILIZING TRANSFORMER-BASED MULTI-SENSOR FUSION

Join the IEEE Signal Processing Society