Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

Yifei Xin (Peking University); Dongchao Yang (Peking university); Yuexian Zou (Peking University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.

Tags:

Modeling, analysis and synthesis of acoustic environments

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

Yifei Xin (Peking University); Dongchao Yang (Peking university); Yuexian Zou (Peking University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Neural Fourier Shift for Binaural Speech Rendering

Lightweight Annotation and Class Weight Training for Automatic Estimation of Alarm Audibility in Noise

Geometry-aware DoA Estimation using a Deep Neural Network with mixed-data input features

Join the IEEE Signal Processing Society