PERCEPTUAL-SIMILARITY-AWARE DEEP SPEAKER REPRESENTATION LEARNING FOR MULTI-SPEAKER GENERATIVE MODELING

Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:20

08 May 2022

We propose novel algorithms for incorporating perceptual similarity among speakers into deep speaker representation learning. The proposed speaker representation learning algorithms use a perceptual speaker similarity matrix obtained from large-scale perceptual scoring as the target for the speaker encoder training. The algorithms learn speaker embeddings with three different representations of the matrix: a set of vectors, the Gram matrix, and a graph. To reduce costs of scoring and training, we further propose an active learning algorithm that iterates the perceptual similarity scoring and speaker encoder training. The algorithm selects speaker pairs to be scored next based on the sequentially-trained speaker encoder's similarity prediction results. The evaluation results demonstrate that 1) our representation learning algorithms learn speaker embeddings strongly correlated with perceptual similarity scores, 2) the embeddings improve synthetic speech quality in speech autoencoding tasks better than conventional d-vectors obtained by discriminative modeling, 3) our active learning algorithm achieves higher synthetic speech quality while reducing costs of scoring and training, and 4) among the proposed similarity {vector, matrix, graph} embedding algorithms, the first achieves the best speaker similarity for synthetic speech, and the third gives the most improvement in the synthetic speech naturalness.

Tags:

null

PERCEPTUAL-SIMILARITY-AWARE DEEP SPEAKER REPRESENTATION LEARNING FOR MULTI-SPEAKER GENERATIVE MODELING

Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

PROGRESS-ICASSP 2022: Introduction by Farokh Atashzar and Nancy F. Chen

PROGRESS-ICASSP 2022: Opening Speech

MULTIMODAL DATA FUSION IN HIGH-DIMENSIONAL HETEROGENEOUS DATASETS VIA GENERATIVE MODELS

Join the IEEE Signal Processing Society