Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 13:06
The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart - without annotation - the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust.