VIDEO-MUSIC RETRIEVAL WITH FINE-GRAINED CROSS-MODAL ALIGNMENT
Yuki Era, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
-
SPS
IEEE Members: $11.00
Non-members: $15.00
This paper presents a novel video-music retrieval method for videos containing humans. Our method constructs a cross-modal common embedding space of video features based on human motion and music features and aligns the embedding space based on the beat and genre information. The beat and genre are distinctive properties of music, and music relates motion features better than features extracted from the whole of videos. Therefore, our method realizes the cross-modal retrieval focusing on the motion-music relation and improves the retrieval performance by utilizing motion-based video features and the music property-based embedding space alignment. This is the first task which relates human motion to music for the video-music retrieval. In the experiments, we compare our method with the state-of-the-art video-music and human motion-music retrieval methods and verify that our method achieves more than twice the retrieval performance of the conventional methods in Recall@1.