LOCAL-GLOBAL CONTRAST FOR LEARNING VOICE-FACE REPRESENTATIONS
Guangyu Chen, Deyuan Zhang, Tao Liu, Xiaoyong Du
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Leveraging deep learning to explore the associations between voices and faces has attracted extensive research interest. Usually, this research is formalized by cross-modal verification, matching, and retrieval tasks. Most works rely on a single local or global optimization objective for learning, ignoring that testing scenarios may prefer different optimization objectives. For example, local objectives are more helpful for verification and matching tasks, while global objectives contribute more to retrieval. In this study, we proposed a learning framework based on local and global objectives to improve the generalizability of the learned representations. Firstly, we explored two ways of applying supervised contrastive loss (SCL) to learn voice-face representations. Secondly, we designed a contrastive-form global optimization objective, which shows better performance and training efficiency. Experiments on the VoxCeleb dataset demonstrate the effectiveness of our framework.