STATISTICAL PYRAMID DENSE TIME DELAY NEURAL NETWORK FOR SPEAKER VERIFICATION
Zi-Kai Wan, Qing-Hua Ren, You-Cai Qin, Qi-Rong Mao
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:34
Recently, speaker verification (SV) techniques relay on deep learning frameworks to extract more informative embedding vectors, which greatly improves the accuracy compared with traditional machine learning methods. The well-known x-vector architecture, a time delay neural network (TDNN), is widely adapted for SV tasks. However, most of existing variants rarely combines the global and sub-region context information and suffer from the local receptive field that is engendered by the standard convolutional operation. In this paper, we propose statistical pyramid dense TDNN (SPD-TDNN) with the statistical pyramid pooling module which captures the context information. Specifically, the developed module adaptively exchanges information among contextual regions from different perspectives, which correspond to multiple parallel branches. The statistics collected by the global-region branch are comprised of mean and standard deviation across the time domain to acquire the more global context information. Extensive experiments on the VoxCeleb1&2 datasets demonstrate that the proposed PSD-TDNN outperforms corresponding D-TDNN, D-TDNN-SS and ECAPA-TDNN which achieve the state-of-the-art performances on the SV task, with similar model complexity.