LEARN A ROBUST REPRESENTATION FOR COVER SONG IDENTIFICATION VIA AGGREGATING LOCAL AND GLOBAL MUSIC TEMPORAL CONTEXT
Deshun Yang, Xiaoou Chen, Chaoya Jiang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 09:43
Recently, deep learning models have been proposed for cover song identification and designed to learn fixed-length feature vectors for music recordings. However, the aspect of the temporal progression of music, which is important for measuring the melody similarity between two recordings, is not well exploited in those models. In this paper, we propose a new Siamese architecture to learn deep representations for cover song identification where Dilated Temporal Pyramid Convolution is used to exploit the local temporal context and Temporal Self-Attention to exploit the global temporal context in music recordings. In addition to the traditional block which calculates the similarity between a pair of recordings, we add a classification block to classify the recordings to their respective cliques. By combining the regression loss and the classification loss, our model can learn more robust and discriminative latent representations. The representations extracted by our model show substantial superiority to existing hand-crafted features and learned deep features. Experimental results show that our approach far outperforms the state-of-the-art methods on several public datasets.