TRANSFORMER-BASED BIOACOUSTIC SOUND EVENT DETECTION ON FEW-SHOT LEARNING TASKS
Liwen You (Amazon); Erika Pelaez Coyotl (Amazon); Suren Gunturu (Amazon); Maarten Van Segbroeck (Amazon)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Automatic detection of bioacoustic sound events is crucial to monitor wildlife. With tedious annotation process, limited labeled events and large volume of recordings, few-shot learn- ing (FSL) is suitable for such event detections based on a few examples. Typical FSL frameworks for sound detection make use of Convolutional Neural Networks (CNNs) to extract features. However, CNNs fail to capture long-range relationships and global context in audio data. We present an approach that combines the audio spectrogram transformer (AST), a data augmentation regime and transductive inference to detect sound events on the DCASE2022 (Task 5) dataset. Our results show that the AST model performs better on all recordings when compared to a CNN based model. With transductive inference on FSL tasks, our approach has 6% improvement over the baseline AST feature extraction pipeline. Our approach generalizes well over sound events from different animal species, recordings and durations, suggesting its effectiveness for FSL tasks.