Learning Music Sequence Representation from Text Supervision

Tianyu Chen, Shuai Zhang, Haoyi Zhou, Jianxin Li, Yuan Xie, Shaohan Huang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:10:29

13 May 2022

Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals. To excavate better MUsic SEquence Representation from labeled audio, we propose a novel text-supervision pre-training method, namely MUSER. MUSER adopts an audio-spectrum-text tri-modal contrastive learning framework, where the text input could be any form of meta-data with the help of text templates while the spectrum is derived from an audio sequence. Our experiments reveal that MUSER could be more flexibly adapted to downstream tasks compared with the current data-hungry pre-training method, and it only requires 0.056% of pre-training data to achieve the state-of-the-art performance.

Tags:

music representation

contrastive learning

cross-modal learning

deep learning

Learning Music Sequence Representation from Text Supervision

Tianyu Chen, Shuai Zhang, Haoyi Zhou, Jianxin Li, Yuan Xie, Shaohan Huang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICASSP 2023 COURSE 2: Graph Signal Processing and Geometric Learning: A Foundational Approach (Parts 1-4)

Short Course Bundle: ICASSP 2023 COURSE 1: A Hands-on Approach for Implementing Stochastic Optimization Algorithms from Scratch (Parts 1-4)

Audio Signal Enhancement: A Weakly Supervised Deep Learning Approach

Join the IEEE Signal Processing Society