Weakly Labelled Audio Tagging Via Convolutional Networks With Spatial And Channel-Wise Attention
Sixin Hong, Wenwu Wang, Yuexian Zou, Meng Cao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:19
Multiple instance learning (MIL) with convolutional neural networks (CNNs) has been proposed recently for weakly labelled audio tagging. However, features from the various filtering channels and spatial regions are often treated equally, which may limit its performance in event prediction. In this paper, we propose a novel attention mechanism, namely, spatial and channel-wise attention (SCA). For spatial attention, we divide it into global and local submodules with the former to capture the event-related spatial regions and the latter to estimate the onset and offset of the event. Considering the variations in CNN channels, channel-wise attention is also exploited to recognize different sound scenes. The proposed SCA can be employed into any CNNs seamlessly with affordable overheads and is end-to-end trainable fashion. Extensive experiments on weakly labelled dataset Audioset show that the proposed SCA with CNNs achieves a state-of-the-art mean average precision (mAP) of 0.390.