VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT

Yuki Era, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Poster 10 Oct 2023

This paper presents a novel video-music retrieval method for videos containing humans. Our method constructs a cross-modal common embedding space of video features based on human motion and music features and aligns the embedding space based on the beat and genre information. The beat and genre are distinctive properties of music, and music relates motion features better than features extracted from the whole of videos. Therefore, our method realizes the cross-modal retrieval focusing on the motion-music relation and improves the retrieval performance by utilizing motion-based video features and the music property-based embedding space alignment. This is the first task which relates human motion to music for the video-music retrieval. In the experiments, we compare our method with the state-of-the-art video-music and human motion-music retrieval methods and verify that our method achieves more than twice the retrieval performance of the conventional methods in Recall@1.

Tags:

cross-modal retrieval

Music retrieval

Video retrieval

Human motion

VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT

Yuki Era, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

More Like This

LOCAL-GLOBAL CONTRAST FOR LEARNING VOICE-FACE REPRESENTATIONS

UNSUPERVISED CONTRASTIVE HASHING FOR CROSS-MODAL RETRIEVAL IN REMOTE SENSING

WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Join the IEEE Signal Processing Society