Semantically-informed Deep Neural Networks for sound recognition

Michele Esposito (Maastricht University); Giancarlo Valente (Maastricht University); Yenisel Plasencia-Calaña (Maastricht University); Michel Dumontier (Maastricht University); Bruno L. Giordano (CNRS); Elia Formisano (Maastricht University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Deep neural networks (DNNs) for sound recognition learn to categorize a barking sound as a "dog" and a meowing sound as a "cat" but do not exploit information inherent to the semantic relations between classes (e.g., both are animal vocalisations). Cognitive neuroscience research, however, suggests that human listeners automatically exploit higher-level semantic information on the sources besides acoustic information. Inspired by this notion, we introduce here a DNN that learns to recognize sounds and simultaneously learns the semantic relation between the sources (semDNN). Comparison of semDNN with a homologous network trained with categorical labels (catDNN) revealed that semDNN produces semantically more accurate labelling than catDNN in sound recognition tasks and that semDNN-embeddings preserve higher-level semantic relations between sound sources. Importantly, through a model-based analysis of human dissimilarity ratings of natural sounds, we show that semDNN approximates the behaviour of human listeners better than catDNN and several other DNN and NLP comparison models.

Tags:

Signal processing for images and video modeling

Semantically-informed Deep Neural Networks for sound recognition

Michele Esposito (Maastricht University); Giancarlo Valente (Maastricht University); Yenisel Plasencia-Calaña (Maastricht University); Michel Dumontier (Maastricht University); Bruno L. Giordano (CNRS); Elia Formisano (Maastricht University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions

Learning to Reconnect Interrupted Trajectories for Weakly Supervised Multi-Object Tracking

ROBUST CONTENT-VARIANT REFERENCE IMAGE QUALITY ASSESSMENT VIA SIMILAR PATCH MATCHING

Join the IEEE Signal Processing Society