Enhanced Embeddings in Zero-Shot Learning for Environmental Audio
Ysobel Sims (The University of Newcastle); Alexandre Mendes (The University of Newcastle); Stephan K Chalup (The University of Newcastle)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Zero-shot learning is a scenario in machine learning where the classes used in the training and test sets are disjoint.
This work considers zero-shot learning for environmental audio and improves results by enhancing audio and word embeddings. Previous works use the VGGish model for audio embeddings, and textual class labels are often used as input for word embedding networks such as Word2Vec. This study instead uses a modified YAMNet network to obtain semantic audio embeddings for zero-shot learning. Moreover, part of this study involves adding linguistic devices, such as synonyms, semantic broadening and onomatopoeia, to the input of the word embeddings. With these two modifications, top-1 accuracy is increased on average by over five percentage points compared to the state-of-the-art on ESC-50. This emerging area of research has applications in robot awareness, security systems and wildlife conservation in situations where no data is available for some classes.