MOTION-AWARE VIDEO PARAGRAPH CAPTIONING VIA EXPLORING OBJECT-CENTERED INTERNAL KNOWLEDGE
hu yimin (Fudan University); Guorui Yu (Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics); Xuequan Lu (Deakin University); Shang Gao (Deakin University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Different from the images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to understanding the whole video content. Existing works rarely put focus on modeling the dynamic changing state of the objects in the videos, causing the activities occurred in videos are poorly or wrongly depicted in paragraphs. To address this problem, we propose a novel Object State Tracking Network, which can capture the temporal state change of objects. However, due to the similarity of the consecutive frames in the videos, the information of the video is redundant and noisy. We further propose a semantic alignment mechanism, and enable the sentence information to refine the visual information. Extensive experiments on ActivityNet Captions demonstrate the effectiveness of our method.