Acoustic Scene Classification Using Deep Residual Networks With Late Fusion Of Separated High And Low Frequency Paths
Mark D. McDonnell, Wei Gao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 13:54
We investigate the problem of acoustic-scene classification, using a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas.~We design the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we use two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output.~We conduct experiments using two public 2019 DCASE datasets for acoustic scene classification; the first with binaural audio inputs recorded by a single device, and the second with single-channel audio inputs recorded through various devices. We show the performance of our models are significantly enhanced by the use of log-mel deltas, and that overall our approach is capable of training strong single models, without use of any supplementary data, with excellent generalization to unknown devices. In particular, our approach achieved second place in 2019 DCASE Task 1b (0.4% behind the winning entry), and the best Task 1B evaluation results (by a large margin of over 5%) on test data from a device not used to record any training data.