LEVERAGING VISUAL PROMPTS TO GUIDE LANGUAGE MODELING FOR REFERRING VIDEO OBJECT SEGMENTATION
Qiqi Gao, Wanjun Zhong, Jie Li, Tiejun Zhao
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Referring Video Object Segmentation (R-VOS) aims to segment object masks in a target video given a language query describing the object. It is a challenging task that requires modeling the semantics of a natural language query and its correspondence to the target video. Previous works directly use visual-agnostic language features from uni-modal language models, and only interact with visual features in late decoding stages. We propose to encode visual-enriched language features by using visual prompts as guidance in the early encoding stage. The proposed visual prompt is constructed by modulating visual features of key frames with alignment scores to text inputs. The alignment score is computed with a pre-trained visual-language contrastive model. We concatenate visual prompts with text inputs to encode visual-enriched language features, which serve as queries for target object segmentation in a Transformer-based decoder. Our method outperforms the previous state-of-the-art method (+2.3) on Refer-Youtube-VOS benchmark.