A FRAME LOSS OF MULTIPLE INSTANCE LEARNING FOR WEAKLY SUPERVISED SOUND EVENT DETECTION
Xu Wang, Xiangjinzi Zhang, Shengwu Xiong, Yunfei Zi
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:45
Sound event detection(SED) consists of two subtasks: predicting the classes of sound events within an audio clip (audio tagging) and indicating the onset and offset times for each event (localization). One of the common approaches for SED with weak label is multiple instance learning (MIL) method. However, the general MIL method only optimizes the global loss calculated from the aggregated clip-wise predictions and weak clip labels, lacking a direct constraint on the frame-wise predictions, which leads to a large number of unreasonable prediction values. To address this issue, we explore the deterministic information that can be used to constrain the frame-wise predictions and based on which we design a frame loss with two terms. Experimental results on the DCASE2017 Task4 dataset demonstrate that the proposed loss can improve the performance of general MIL method. While this article focuses on SED applications, the proposed methods could be applied widely to MIL problems.