Improving Deep Cnn Networks With Long Temporal Context For Text-Independent Speaker Verification

Yong Zhao, Tianyan Zhou, Zhuo Chen, Jian Wu

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 12:06

04 May 2020

Deep CNN networks have shown great success in various tasks for text-independent speaker recognition. In this paper, we explore two approaches for modeling long temporal contexts to improve the performance of the ResNet networks. The first approach is simply integrating the utterance-level mean and variance normalization into the ResNet architecture. Secondly, we combine the BLSTM and ResNet into one unified architecture. The BLSTM layers model long range, supposedly phonetically aware, context information, which could facilitate the ResNet to learn the optimal attention weight and suppress the environmental variations. The BLSTM outputs are projected into multiple-channel feature maps and fed into the ResNet network. Experiments on the VoxCeleb1 and the internal MS-SV tasks show that with attentive pooling, the proposed approaches achieve up to 23-28% relative improvement in EER over a well-trained ResNet.

Tags:

sps conference

icassp 2020 virtual conference

May 2020

icassp 2020

Improving Deep Cnn Networks With Long Temporal Context For Text-Independent Speaker Verification

Yong Zhao, Tianyan Zhou, Zhuo Chen, Jian Wu

Value-Added Bundle(s) Including this Product

ICASSP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society