Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 12:06
04 May 2020

Deep CNN networks have shown great success in various tasks for text-independent speaker recognition. In this paper, we explore two approaches for modeling long temporal contexts to improve the performance of the ResNet networks. The first approach is simply integrating the utterance-level mean and variance normalization into the ResNet architecture. Secondly, we combine the BLSTM and ResNet into one unified architecture. The BLSTM layers model long range, supposedly phonetically aware, context information, which could facilitate the ResNet to learn the optimal attention weight and suppress the environmental variations. The BLSTM outputs are projected into multiple-channel feature maps and fed into the ResNet network. Experiments on the VoxCeleb1 and the internal MS-SV tasks show that with attentive pooling, the proposed approaches achieve up to 23-28% relative improvement in EER over a well-trained ResNet.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00