Frame-Level Phoneme-Invariant Speaker Embedding For Text-Independent Speaker Recognition On Extremely Short Utterances
Naohiro Tawara, Atsunori Ogawa, Tomoharu Iwata, Tetsuji Ogawa, Marc Delcroix
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:08
This paper investigates a phoneme-invariant speaker embedding approach for speaker recognition on extremely short utterances. Intuitively, phonemes are nuisance information for text-independent speaker recognition task since the contents of the speech are usually mismatched between enrolling and testing time. However, many studies have shown that incorporating phoneme information is quite effective to improve the performance of the speaker recognition system. One reasonable explanation for this counter-intuitive result is that the pooling mechanism of segment-based speaker embedding can focus on the specific phonemes which contain rich speaker information, and phoneme information may help this. From this insight, we hypothesize that the pooling mechanism and phoneme-aware training are harmful to extract the speaker embeddings from extremely short utterances. To verify this hypothesis, an adversarial framework is introduced to remove phoneme-variability from the frame-wise speaker embeddings. The experimental results on the Librispeech corpus confirm that our frame-wise, phoneme-adversarial approach outperforms the conventional segment-wise, phoneme-aware approach for short utterances of less than about 1.4 seconds.