Exploring Visual-Audio Composition Alignment Network For Quality Fashion Retrieval In Video
Yanhao Zhang, Jianmin Wu, Xiong Xiong, Dangwei Li, Chenwei Xie, Yun Zheng, Pan Pan, Yinghui Xu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:12
Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.
Chairs:
Sicheng Zhao