Weakly Labelled Audio Tagging Via Convolutional Networks With Spatial And Channel-Wise Attention

Sixin Hong, Wenwu Wang, Yuexian Zou, Meng Cao

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 14:19

04 May 2020

Multiple instance learning (MIL) with convolutional neural networks (CNNs) has been proposed recently for weakly labelled audio tagging. However, features from the various filtering channels and spatial regions are often treated equally, which may limit its performance in event prediction. In this paper, we propose a novel attention mechanism, namely, spatial and channel-wise attention (SCA). For spatial attention, we divide it into global and local submodules with the former to capture the event-related spatial regions and the latter to estimate the onset and offset of the event. Considering the variations in CNN channels, channel-wise attention is also exploited to recognize different sound scenes. The proposed SCA can be employed into any CNNs seamlessly with affordable overheads and is end-to-end trainable fashion. Extensive experiments on weakly labelled dataset Audioset show that the proposed SCA with CNNs achieves a state-of-the-art mean average precision (mAP) of 0.390.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020