Skip to main content

Multi-Resolution Multi-Head Attention In Deep Speaker Embedding

Zhiming Wang, Shuo Fang, Kaisheng Yao, Xiaolong Li

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:58
04 May 2020

Pooling is an essential component to capture long-term speaker characteristics for speaker recognition. This paper proposes simple but effective pooling methods to compute attentive weights for better temporal aggregation over the variable-length input speech, enabling the end-to-end neural network to have improved performance for discriminating among speakers. Particularly, we observe that using multiple heads for attentive pooling over the entire encoded sequence, a method we term as global multi-head attention, significantly improves performance in comparison to various pooling methods, including the recently proposed multi-head attention [1]. To improve diversity of attention heads, we further propose multi-resolution multi-head attention for pooling that has an additional temperature hyperparameter for each head. This leads to even larger performance gain, on top of that achieved using multiple heads. On the benchmark VoxCeleb1 dataset, the proposed method achieves the state-of-the-art performance of Equal Error Rate (EER) of 3.966%. Our analysis shows that using multiple heads and having multiple resolutions on these heads with different temperatures lead to improved certainty of attentive weights in the new state-of-the-art system.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00