Orthogonal Training For Text-Independent Speaker Verification
Yingke Zhu, Brian Mak
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 14:44
In this paper we propose orthogonal training schemes to improve the effectiveness of cosine similarity measurements in text-independent speaker verification (SV) tasks. Compared to the PLDA backend, cosine similarity is simple to compute, and it does not require extra data or time to build a separate model. The use of cosine similarity measurement is also highly desirable for building end-to-end SV systems. However, the cosine similarity has an underlying assumption that the dimensions of the speaker embeddings are orthogonal, which is usually not satisfied in current SV systems. The training scheme applies singular vector decomposition (SVD) to the weight matrix of the speaker embedding extraction layer in our time delay neural network (TDNN)-based SV system, and replaces the original weight matrix by the matrix constructed from the left unitary matrix and the singular value matrix. Then the reconstructed matrix in the extraction layer is held constant and the remaining network is fine-tuned with an orthogonality regularizer. We further investigate orthogonal training from scratch, with orthogonality regularization incorporated throughout the network training. Experimental results show that our orthogonal training methods can significantly improve the system performance with a cosine similarity backend.