Self-Supervised Contrastive Learning for Singing Voices

Hiromu Yakura (University of Tsukuba); Kento Watanabe (National Institute of Advanced Industrial Science and Technology (AIST)); Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

09 Jun 2023

This study introduces self-supervised contrastive learning to acquire feature representations of singing voices. To acquire robust representations in an unsupervised manner, regular self-supervised contrastive learning trains neural networks to make the feature representation of a sample close to those of its computationally transformed versions. Similarly, we employ two transformations—pitch shifting and time stretching—considering the nature of singing voices. Nevertheless, we use them reversely: we train networks to push away representations of the transformed versions. The networks then attempt to discriminate changes in vocal timbres introduced by pitch shifting without time stretching and those in singing expressions introduced by time stretching without pitch shifting. Consequently, the acquired representations become attentive to vocal timbre and singing expression. This was confirmed through a singer identification task, where we trained a classifier to learn the relationship between the feature representations to the corresponding singer labels of 500 singers. As a result, the employed transformations helped the classifier improve the classification accuracy by 9.12% (top-1 accuracy: 63.08%) compared with the case where the feature representations fed to the classifier were acquired without the transformations (top-1 accuracy: 53.96%). Furthermore, the proposed approach can be extended to acquire feature representations attentive to either vocal timbre or singing expression but not to the other by changing how the transformations are incorporated. We particularly explored the characteristics of such vocal timbre- or singing expression-oriented feature representations against song genre, singer gender, and vocal technique, and confirmed that they successfully capture different aspects of singing voices.

Tags:

Speech production, perception and psychoacoustics

Self-Supervised Contrastive Learning for Singing Voices

Hiromu Yakura (University of Tsukuba); Kento Watanabe (National Institute of Advanced Industrial Science and Technology (AIST)); Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Utilization of Bessel Beams in Wideband Sub Terahertz Communication Systems to Mitigate Beamsplit Effects in the Near-field

Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

Online Phase Reconstruction via DNN-Based Phase Differences Estimation

Join the IEEE Signal Processing Society