MEMORY-AUGMENTED CONTRASTIVE LEARNING FOR TALKING HEAD GENERATION

Jianrong Wang (School of Computer Science and Technology, Tianjin University, Tianjin, China); Yaxin Zhao (Tianjin International Engineering Institute, Tianjin University, Tianjin, China); Hongkai Fan (School of Computer Science and Technology, Tianjin University, Tianjin, China); Tianyi Xu (Tianjin University); Qi Li (School of Electrical and Information Engineering, Tianjin University, Tianjin, China); Sen Li (School of Computer Science and Technology, Tianjin University, Tianjin, China); Li Liu (Shenzhen Research Institute of Big Data, the chinese university of hong kong shenzhen)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Given one reference facial image and a piece of speech as input, talking head generation aims to synthesize a realistic-looking talking head video. However, generating a lip-synchronized video with natural head movements is challenging. The same speech clip can generate multiple possible lip and head movements, that is, there is no one-to-one mapping relationship between them. To overcome this problem, we propose a Speech Feature Extractor (SPF) based on memory-augmented self-supervised contrastive learning, which introduces the memory module to store multiple different speech mapping results. In addition, we introduce the Mixed Density Networks (MDN) into the landmark regression task to generate multiple predicted facial landmarks. Extensive qualitative and quantitative experiments show that the quality of our facial animation is significantly superior to that of the state-of-the-art (SOTA). The code has been released at https://github.com/Yaxinzhao97/MACL.git.

Tags:

Image and video representation