Audio-driven Talking Head Video Generation with Diffusion Model

Yizhe Zhu (Shanghai Jiao Tong University); Chunhui Zhang (Shanghai Jiaotong University, CloudWalk Technology Co., Ltd); Qiong Liu (CloudWalk); Xi Zhou (CloudWalk Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Synthesizing high-fidelity talking head videos by fitting input audio sequences is a highly anticipated technique in many applications, such as digital humans, virtual video conferences, and human-computer interaction. Popular GAN-based methods aim to align speech audio with lip motions and head poses. However, existing methods are prone to training instability and even mode collapse, resulting in low-quality video generation. In this paper, we propose a novel audio-driven diffusion method for generating high-resolution realistic videos of talking heads with the help of denoising diffusion model. Specifically, the face attribute disentanglement module is proposed to disentangle eye blinking and lip motion features, where the lip motion features are synchronized with audio features via the contrastive learning strategy, and the disentangled motion features are aligned well with the talking head. Furthermore, the denoising diffusion model takes the source image and the warped motion features as input to generate the high-resolution realistic talking head with diverse head poses. Extensive evaluations using multiple metrics demonstrate that our method outperforms the current techniques both qualitative and quantitatively.

Tags:

Multimedia analysis and synthesis

Audio-driven Talking Head Video Generation with Diffusion Model

Yizhe Zhu (Shanghai Jiao Tong University); Chunhui Zhang (Shanghai Jiaotong University, CloudWalk Technology Co., Ltd); Qiong Liu (CloudWalk); Xi Zhou (CloudWalk Technology)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Code-Switching Speech Synthesis Based on Self- Supervised Learning and Domain Adaptive Speaker Encoder

Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing

TWO-STREAM JOINT-TRAINING FOR SPEAKER INDEPENDENT ACOUSTIC-TO-ARTICULATORY INVERSION

Join the IEEE Signal Processing Society