SSVMR: SALIENCY-BASED SELF-TRAINING FOR VIDEO-MUSIC RETRIEVAL
Xuxin Cheng (Peking University); Zhihong Zhu (Peking University); Hongxiang Li (Peking University); Yaowei Li (Peking University); Yuexian Zou (Peking University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by using a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of video by mixing the two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin.