CNN-TRANSFORMER WITH SELF-ATTENTION NETWORK FOR SOUND EVENT DETECTION
Keigo Wakayama, Shoichiro Saito
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:02
In sound event detection (SED), the representation ability of deep neural network (DNN) models must be increased to significantly improve the accuracy or increase the number of classifiable classes. When building large-scale DNN models, a highly parameter-efficient DNN architecture should preferably be adopted. In image recognition, there has been a proposal to replace a convolutional neural network (CNN) extracting high-level features with a highly parameter-efficient DNN architecture, i.e., a self-attention network (SAN). The high-level features are essential information that contributes to prediction. In SED, we find that a model that exceeds the prediction accuracy of CNN-Transformer is difficult to build simply by replacing CNN with SAN, in the process of our experiments. To construct a model with high prediction accuracy while capturing the properties of acoustic signals well, we propose an architecture called a CNN-SAN-Transformer, which retains CNN in the blocks close to the input and uses SAN in all remaining blocks. Experimental results suggest that the proposed method has the same or higher prediction accuracy with a smaller number of parameters than the CNN-Transformer and higher prediction accuracy with a similar number of parameters to the CNN-Transformer and that the proposed method may be a parameter-efficient architecture.