PCF: ECAPA-TDNN with Progressive Channel Fusion for Speaker Verification
Zhenduo Zhao (Institute of Acoustics, Chinese Academy of Sciences); Zhuo Li (Key Laboratory of Speech Acoustics and Content Understanding,Institute of Acoustics, Chinese Academy of Sciences); Wenchao Wang (Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China); pengyuan zhang ( Institute of Acoustics, Chinese Academy of Sciences)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
ECAPA-TDNN is currently the most popular TDNN-series model for speaker verification, which refreshed the state-of-the-art(SOTA) performance of TDNN models. However, one-dimensional convolution has a global receptive field over the feature channel. It destroys the time-frequency relevance of the spectrogram. Besides, as ECAPA-TDNN only has five layers, a much shallower structure compared to ResNet restricts the capability to generate deep representations. To further improve ECAPA-TDNN, we first propose a progressive channel fusion strategy that split the spectrogram across feature dimension and gradually expand the receptive field through the network. Secondly, the model is enlarged by extending depth and branching blocks. Our proposed model achieves EER with 0.718 and minDCF(0.01) with 0.0858 on vox1o, relatively improved 16.1\% and 19.5\% compared with ECAPA-TDNN-large.