Enhanced Embeddings in Zero-Shot Learning for Environmental Audio

Ysobel Sims (The University of Newcastle); Alexandre Mendes (The University of Newcastle); Stephan K Chalup (The University of Newcastle)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Zero-shot learning is a scenario in machine learning where the classes used in the training and test sets are disjoint. This work considers zero-shot learning for environmental audio and improves results by enhancing audio and word embeddings. Previous works use the VGGish model for audio embeddings, and textual class labels are often used as input for word embedding networks such as Word2Vec. This study instead uses a modified YAMNet network to obtain semantic audio embeddings for zero-shot learning. Moreover, part of this study involves adding linguistic devices, such as synonyms, semantic broadening and onomatopoeia, to the input of the word embeddings. With these two modifications, top-1 accuracy is increased on average by over five percentage points compared to the state-of-the-art on ESC-50. This emerging area of research has applications in robot awareness, security systems and wildlife conservation in situations where no data is available for some classes.

Tags:

Deep learning techniques

Enhanced Embeddings in Zero-Shot Learning for Environmental Audio

Ysobel Sims (The University of Newcastle); Alexandre Mendes (The University of Newcastle); Stephan K Chalup (The University of Newcastle)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Adaptive Scale and Spatial Aggregation for Real-time Object Detection

Training Robust Spiking Neural Networks with ViewPoint Transform and SpatioTemporal Stretching

ANALYSING THE MASKED PREDICTIVE CODING TRAINING CRITERION FOR PRE-TRAINING A SPEECH REPRESENTATION MODEL

Join the IEEE Signal Processing Society