TEMPORAL CONTRASTIVE-LOSS FOR AUDIO EVENT DETECTION
Sandeep Kothinti, Mounya Elhilali
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:46
Temporal coherence is a feature-binding mechanism that ensures features that evolve together in time belong to the same object or event. Coherence has been extensively studied in biological systems, demonstrating how our brain leverages this mechanism to perform complex tasks in real environments and facilitate segregation of complex sensory signals (or wholes) into individual objects (or parts), following Gestalt principles. Though intuitive and computationally tractable, these concepts have rarely been leveraged in audio technologies. Audio event detection is an application that specifically deals with identifying sound events in an audio recording; hence is a natural avenue to explore principles of temporal coherence. In this study, we propose coherence-based learning, formulated as a contrastive loss, to train event detection models whereby embeddings driven by acoustic events are coherently constrained to maximize discriminability across events. This approach results in improved detection performance with no additional computational cost and a very small overhead during the training procedure.