Noise-Robust Spoken Language Identification Using Language Relevance Factor Based Embedding
Muralikrishna H, Shikha Gupta, Dileep Aroor Dinesh, Padmanabhan Rajan
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 0:14:57
State-of-the-art systems for spoken language identification (LID) use i-vector or embedding extracted using a deep neural network (DNN) to represent the utterance. These fixed-length representations are obtained without explicitly considering the relevance of individual frame-level feature vectors in deciding the class label. In this paper, we propose a new method to represent the utterance that considers the relevance of the individual frame-level features. The proposed representation can also preserve the locally available LID-specific information in the input features to some extent. To better utilize the local-level information in the new representation, we propose a novel segment-level matching kernel based support vector machine (SVM) classifier. The proposed representation of the utterance based on the relevance of frame-level features improves the robustness of the LID system to different background noise conditions in the speech. The experiments conducted on speech with different background conditions show that the proposed approach performs better than state-of-the-art approaches in noisy speech and performs similarly to the state-of-the-art systems in clean speech condition.