Automated Audio Captioning using Transfer Learning and Reconstruction Latent Space Similarity Regularization

Andrew Koh, Fuzhao Xue, Eng Siong Chng

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:14:13

11 May 2022

In this paper, we examine the use of Transfer Learning using Pretrained Audio Neural Networks (PANNs), and propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task. We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR). The RLSSR module supplements the training of the model by maximizing the similarity between the encoder and decoder embedding. The combination of both methods allows us to surpass state of the art results by a significant margin on the Clotho dataset across several metrics and benchmarks.

Tags:

automated audio captioning

transfer learning

transformers

Value-Added Bundle(s) Including this Product

22 May 2022

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

04 Feb 2025

Short Course Bundle: ICIP 2023 COURSE 2: Short Course: Unboxing Advancements in Biomedical Image Processing (Parts 1-4)

SPS

Members: $65.00
IEEE Members: $85.00
Non-members: $100.00

31 Jan 2025

Short Course Bundle: ICASSP 2022 COURSE 6: Transformer Architectures for Multimodal Signal Processing and Decision Making (Parts 1-3)

SPS

Members: $65.00
IEEE Members: $85.00
Non-members: $100.00

23 Oct 2024

Tutorial: Fundamentals of Transformers: A Signal-processing View

SPS

Members: $10.00
IEEE Members: $22.00
Non-members: $30.00