Bridging Mixture Density Networks With Meta-Learning For Automatic Speaker Identification
Ruirui Li, Wei Wang, Chu-Cheng Hsieh, Jyun-Yu Jiang, Hongda Mao, Xian Wu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:20
Speaker identification answers the fundamental question "Who is speaking?" The identification technology enables downstream applications to provide a personalized experience. Both the prevalent i-vector based solutions and deep learning solutions usually treat all users equally during the training process. We notice that a good many new users start with limited labeled training data, which often results in inferior predicting performance of recognizing users' voices. To alleviate the disadvantage caused by training data deficiency, we propose a Mixture Density Network-based Meta-Learning method MDNML for speaker identification. MDNML emphasizes the expeditious process of learning to recognize new users where each has only a few seconds of labeled data. We conduct experiments on the LibriSpeech dataset and compare MDNML with four state-of-the-art baseline methods. The results conclude that MDNML achieves higher accuracy in recognizing new users with limited labeled utterances than all baseline methods. Our proposed solution significantly expedites the learning by transferring the knowledge learned from the existing user base through gradient-based meta-learning. We consider our work to be a stepping-stone for more sophisticated meta-learning frameworks for accelerating voice recognition. Furthermore, we discuss a strategy for enhancing accuracy by incorporating the notion of household-based acoustic profiles with MDNML.