Skip to main content

Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning

Hui Chen (Tianjin university); Hanyi Zhang (Tianjin university); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Meng Liu (Tianjin University); Jianwu Dang (Tianjin University)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

In self-supervised speaker verification, the quality of pseudo labels determines the upper bound of the its performance and it is not uncommon to end up with massive amount of unreliable pseudo labels. We observe that the complementary information in multi-modalities ensures a robust supervisory signal for audio and visual representation learning. This motivates us to propose an audio-visual self-supervised learning framework named Co-Meta Learning. Inspired by the Co-teaching+, we design a strategy allows the information of two modalities to be coordinated through the "Update by Disagreement". Moreover, we used the idea of model-agnostic meta learning (MAML) to update the network parameters, which makes the hard samples of two modalities to be better resolved by the other modality through gradient regularization. Compared to the baseline, our proposed method achieves a 29.8%, 11.7% and 12.9% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset respectively.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00