VLKP:VIDEO INSTANCE SEGMENTATION WITH VISUAL-LINGUISTIC KNOWLEDGE
ruixiang chen (Zhejiang University of Technology); Sheng Liu (Zhejiang University of Technology); Junhao Chen (Zhejiang University of Technology); BIngnan Guo (Zhejiang University of Technology); Feng Zhang (Zhejiang University of Technology)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Most video instance segmentation(VIS) models only focused on visual knowledge and ignored intrinsic linguistic knowledge. Based on the observation that incorporating linguistic knowledge can significantly improve the model’s contextual understanding of the video, in this paper, we present a Video Instance Segmentation approach with VisualLinguistic Knowledge Prompts(VLKP), a novel paradigm for offline video instance Segmentation. Specifically, we propose the visual-linguistic knowledge prompt training strategy, which incorporates linguistic features with visual features to obtain Visual-Linguistic features and processes it instead of traditional visual features. In addition, we design a new temporal shift encoder to convey information between frames and enhance the temporal sensitivity of the model. On two widely adopted VIS benchmarks, i.e., YouTube-VIS-2019 and YouTube-VIS-2021, VLKP with ResNet-50 obtains
state-of-the-art results,e.g.,47.7 AP on YouTube-VIS-2019 and 42.0 AP on YouTube-VIS-2021. Code is available at https:// github.com/ruixiangC/VLKP.