Learning Pose-Adaptive Lip Sync With Cascaded Temporal Convolutional Network

Ruobing Zheng, Bo Song, Changjiang Ji

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:08:20

10 Jun 2021

Speech-driven lip sync has become a promising technique for generating and editing talking-head videos. These studies mainly use 3D morphable models or 2D facial landmarks as the intermediate face representations. However, 2D-based methods have been stagnant recently due to their inability to handle out-of-plane rotations, even though the 2D landmarks have the advantage of fast and accurate extraction. In this paper, we design a cascaded temporal convolutional network to successively generate mouth shapes and corresponding jawlines based on audio signals and template headposes. Instead of explicitly calibrating the rotation between the predicted mouth and the template face, we employ neural networks to learn the pose-adaptive mapping implicitly. We also propose an image-to-image translation-based neural rendering method for producing high-resolution and photo-realistic videos. Experiments show our solution improves both the mapping accuracy and visual performance than baselines. This work could benefit many real-world applications like virtual anchors, telepresence, and conversational agents.

Chairs:

Patrick Le Callet

Tags:

signal processing society

IEEE icassp 2021

virtual conference

2021

sps

virtual conference icassp 2021

june 6-11 2021

icassp 2021