Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning

Hui Chen (Tianjin university); Hanyi Zhang (Tianjin university); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Meng Liu (Tianjin University); Jianwu Dang (Tianjin University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In self-supervised speaker verification, the quality of pseudo labels determines the upper bound of the its performance and it is not uncommon to end up with massive amount of unreliable pseudo labels. We observe that the complementary information in multi-modalities ensures a robust supervisory signal for audio and visual representation learning. This motivates us to propose an audio-visual self-supervised learning framework named Co-Meta Learning. Inspired by the Co-teaching+, we design a strategy allows the information of two modalities to be coordinated through the "Update by Disagreement". Moreover, we used the idea of model-agnostic meta learning (MAML) to update the network parameters, which makes the hard samples of two modalities to be better resolved by the other modality through gradient regularization. Compared to the baseline, our proposed method achieves a 29.8%, 11.7% and 12.9% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset respectively.

Tags:

Speaker recognition/identification/diarization

Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning

Hui Chen (Tianjin university); Hanyi Zhang (Tianjin university); Longbiao Wang (Tianjin University); Kong Aik Lee (Institute for Infocomm Research, ASTAR); Meng Liu (Tianjin University); Jianwu Dang (Tianjin University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

INCORPORATING UNCERTAINTY FROM SPEAKER EMBEDDING ESTIMATION TO SPEAKER VERIFICATION

Jeffreys divergence-based regularization of neural network output distribution applied to speaker recognition

Moving Towards Non-Binary Gender Identification Via Analysis of System Errors in Binary Gender Classification

Join the IEEE Signal Processing Society