SELF-SUPERVISED AUDIO-VISUAL SPEECH REPRESENTATIONS LEARNING BY MULTIMODAL SELF-DISTILLATION

Jing-Xuan Zhang (University of Science and Technology of China); Genshun Wan (University of Science and Technology of China); Zhen-Hua Ling (University of Science and Technology of China); Jia Pan (iFlytek Research); Jianqing Gao (iFLYTEK); Cong Liu (iFLYTEK Research)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In this work, we present a novel method, named AV2vec, for learning audio-visual speech representations by multimodal self-distillation. AV2vec has a student and a teacher module, in which the student performs a masked latent feature regression task using the multimodal target features generated online by the teacher. The parameters of the teacher model are a momentum update of the student. Since our target features are generated online, AV2vec needs no iteration step like AV-HuBERT and the total training time cost is reduced to less than one-fifth. We further propose AV2vec-MLM in this study, which augments AV2vec with a masked language model (MLM)- style loss using multitask learning. Our experimental results show that AV2vec achieved comparable performance to the AV-HuBERT baseline. When combined with an MLM-style loss, AV2vec-MLM outperformed baselines and achieved the best performance on the downstream tasks.

Tags:

Signal processing for images and video modeling

SELF-SUPERVISED AUDIO-VISUAL SPEECH REPRESENTATIONS LEARNING BY MULTIMODAL SELF-DISTILLATION

Jing-Xuan Zhang (University of Science and Technology of China); Genshun Wan (University of Science and Technology of China); Zhen-Hua Ling (University of Science and Technology of China); Jia Pan (iFlytek Research); Jianqing Gao (iFLYTEK); Cong Liu (iFLYTEK Research)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

DisCoHead: Audio-and-Video-Driven Talking Head Generation by Disentangled Control of Head Pose and Facial Expressions

Learning to Reconnect Interrupted Trajectories for Weakly Supervised Multi-Object Tracking

ROBUST CONTENT-VARIANT REFERENCE IMAGE QUALITY ASSESSMENT VIA SIMILAR PATCH MATCHING

Join the IEEE Signal Processing Society