LOCAL-GLOBAL CONTRAST FOR LEARNING VOICE-FACE REPRESENTATIONS

Guangyu Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Lecture 09 Oct 2023

Leveraging deep learning to explore the associations between voices and faces has attracted extensive research interest. Usually, this research is formalized by cross-modal verification, matching, and retrieval tasks. Most works rely on a single local or global optimization objective for learning, ignoring that testing scenarios may prefer different optimization objectives. For example, local objectives are more helpful for verification and matching tasks, while global objectives contribute more to retrieval. In this study, we proposed a learning framework based on local and global objectives to improve the generalizability of the learned representations. Firstly, we explored two ways of applying supervised contrastive loss (SCL) to learn voice-face representations. Secondly, we designed a contrastive-form global optimization objective, which shows better performance and training efficiency. Experiments on the VoxCeleb dataset demonstrate the effectiveness of our framework.

Tags:

voice-face association

supervised contrastive learning

cross-modal retrieval

multi-modal learning

LOCAL-GLOBAL CONTRAST FOR LEARNING VOICE-FACE REPRESENTATIONS

Guangyu Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du

More Like This

VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT

TAMM: A TASK-ADAPTIVE MULTI-MODAL FUSION NETWORK FOR FACIAL-RELATED HEALTH ASSESSMENTS ON 3D FACIAL IMAGES

UNSUPERVISED CONTRASTIVE HASHING FOR CROSS-MODAL RETRIEVAL IN REMOTE SENSING

Join the IEEE Signal Processing Society