SUMIT KUMAR (IIT Kanpur); B Anshuman (IIT Kanpur); Linus Ruettimann (University of Zurich and ETH Zurich); Richard Hahnloser (University of Zurich and ETH Zurich); Vipul Arora (IIT Kanpur)
IEEE Members: $11.00
Non-members: $15.00
Event detection improves when events are captured by two different modalities rather than just one. But to train detection systems
on multiple modalities is challenging, in particular when there is
abundance of unlabelled data but limited amounts of labeled data.
We develop a novel self-supervised learning technique for multi-
modal data that learns (hidden) correlations between simultaneously
recorded microphone (sound) signals and accelerometer (body vibration) signals. The key objective of this work is to learn useful
embeddings associated with high performance in downstream event
detection tasks when labeled data is scarce and the audio events of
interest — songbird vocalizations — are sparse. We base our approach on deep canonical correlation analysis (DCCA) that suffers
from event sparseness. We overcome the sparseness of positive labels by first learning a data sampling model from the labelled data
and by applying DCCA on the output it produces. This method that
we term balanced DCCA (b-DCCA) improves the performance of
the unsupervised embeddings on the down-stream supervised audio
detection task compared to classsical DCCA. Because data labels
are frequently imbalanced, our method might be of broad utility in
low-resource scenarios.