Disentangling the Horowitz factor: Learning content and style from expressive piano performance
Huan Zhang (Queen Mary University of London); Simon Dixon (Queen Mary University of London)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In the Western art music tradition, expressive piano performance consists of two kinds of information: the score, with pitch and timing expressed in simple musical units along with occasional expression instructions, and the performer's interpretation of the score, involving variations in tempo, dynamics and articulation. In this paper, we present a novel framework for learning representations that disentangle musical content and performance style from expressive piano performances in an unsupervised manner. Our method is based on an extension of the vector-quantized variational autoencoder (VQ-VAE) with individual content and style branches, along with mutual information (MI) minimization techniques and self-supervising strategies. We performed experiments and ablation studies on the ATEPP dataset, a large set of automatically transcribed virtuosic piano performances with rich stylistic variations, and evaluated the content reconstruction and style discrimination in a style-transfer manner. Our experiments demonstrate that the model learnt separate latent variables that encode musical content (such as pitch and relative timing) and stylistic attributes, as generated samples align well with the content input with low note error rates (NER), and the 40-way style discrimination proxy task outperformed the baseline with top1 accruacy of 0.168.