SEMI-SUPERVISED SOUND EVENT DETECTION WITH PRE-TRAINED MODEL
Liang Xu (Beijing Institute of Technology); Lizhong Wang (Samsung); Sijun Bi (Beijing Institute of Technology); Hanyue Liu (Beijing Institute of Technology); Jing Wang (Beijing Institute of Technology)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Sound event detection (SED) is an interesting but challenging task due to the scarcity of data and diverse sound events in real life. In this paper, we focus on the semi-supervised SED task, and utilize pre-trained model from other field to assist in improving the detection effect. Pre-trained models have been widely used in various tasks in the field of speech, such as automatic speech recognition, audio tagging, etc. We use pre-trained model PANNs which is suitable for SED task and propose two methods to fuse the features from PANNs and original model, respectively. In addition, we also propose a weight raised temporal contrastive loss to enhance the temporal characteristics of fused features. Experimental results show that the model using pre-trained model features outperforms the baseline by 8.5% and 9.1% in the public evaluation dataset in terms of polyphonic sound detection score (PSDS).