Improving Speaker-Attribute Estimation By Voting Based On Speaker Cluster Information
Naohiro Tawara, Satoshi Kobashikawa, Hosana Kamiyama, Atsunori Ogawa
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:59
This paper proposes a general post-processing method for improving speaker-attribute estimation. Estimating speaker-specific attributes such as age and gender is an important task with a wide range of applications. While the recent proposed deep neural network-based end-to-end approach achieves high performance, the model tends to over-fit to specific speakers when the amount of training data is limited or imbalanced. To solve this over-fitting problem, we propose a general framework for correcting unreliable results. The proposed algorithm first clusters the target utterances into speaker clusters by speaker similarity based on i-vectors. Then, for each of the speaker clusters, the speaker-attribute class of the cluster is determined by voting on the utterances assigned to the cluster. By then replacing the result of each utterance with the clustersâ speaker-attribute class, we can correct the result of unreliable utterances. We used two tasks to evaluate the proposed algorithm including age estimation using the NIST-SRE10 and age-gender classification using an in-house read speech corpus, yielding significant improvements in mean absolute and classification errors.