Audio-driven facial landmark generation in violin performance using 3DCNN network with self attention model

Ting-Wei Lin (Academia Sinica); Chao-Lin Liu (National Chengchi University); Li Su (Academia Sinica)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

In a music scenario, both auditory and visual elements are essential to achieve an outstandingperformance. Recent research has focused on the generation of body movements or fingering from audio in music performance. The audio-driven face generation technique in music performance is still deficient. In this paper, we compile a violin soundtrack and facial expression dataset (VSFE) for modeling facial expressions in violin performance. To our knowledge, this is the first dataset mapping the relationship between violin performance audio and musicians' facial expressions. We then propose a 3DCNN network with self-attention and residual blocks for audio-driven facial expression generation. In the experiments, we compare our methods with three baselines on talking face generation.

Tags:

Machine/deep learning methodologies for multimedia

Audio-driven facial landmark generation in violin performance using 3DCNN network with self attention model

Ting-Wei Lin (Academia Sinica); Chao-Lin Liu (National Chengchi University); Li Su (Academia Sinica)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

MRML: Multimodal Rumor Detection by Deep Metric Learning

Abusive activity detection with multi-modality based on convolutional neural network

IMPROVING THE MODALITY REPRESENTATION WITH MULTI-VIEW CONTRASTIVE LEARNING FOR MULTIMODAL SENTIMENT ANALYSIS

Join the IEEE Signal Processing Society