CNN-TRANSFORMER WITH SELF-ATTENTION NETWORK FOR SOUND EVENT DETECTION

Keigo Wakayama, Shoichiro Saito

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:09:02

12 May 2022

In sound event detection (SED), the representation ability of deep neural network (DNN) models must be increased to significantly improve the accuracy or increase the number of classifiable classes. When building large-scale DNN models, a highly parameter-efficient DNN architecture should preferably be adopted. In image recognition, there has been a proposal to replace a convolutional neural network (CNN) extracting high-level features with a highly parameter-efficient DNN architecture, i.e., a self-attention network (SAN). The high-level features are essential information that contributes to prediction. In SED, we find that a model that exceeds the prediction accuracy of CNN-Transformer is difficult to build simply by replacing CNN with SAN, in the process of our experiments. To construct a model with high prediction accuracy while capturing the properties of acoustic signals well, we propose an architecture called a CNN-SAN-Transformer, which retains CNN in the blocks close to the input and uses SAN in all remaining blocks. Experimental results suggest that the proposed method has the same or higher prediction accuracy with a smaller number of parameters than the CNN-Transformer and higher prediction accuracy with a similar number of parameters to the CNN-Transformer and that the proposed method may be a parameter-efficient architecture.

Tags:

sound event detection

dnn architecture

vector attention

weakly-supervised sed

self-attention network