MOTION-AWARE VIDEO PARAGRAPH CAPTIONING VIA EXPLORING OBJECT-CENTERED INTERNAL KNOWLEDGE

hu yimin (Fudan University); Guorui Yu (Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics); Xuequan Lu (Deakin University); Shang Gao (Deakin University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Different from the images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to understanding the whole video content. Existing works rarely put focus on modeling the dynamic changing state of the objects in the videos, causing the activities occurred in videos are poorly or wrongly depicted in paragraphs. To address this problem, we propose a novel Object State Tracking Network, which can capture the temporal state change of objects. However, due to the similarity of the consecutive frames in the videos, the information of the video is redundant and noisy. We further propose a semantic alignment mechanism, and enable the sentence information to refine the visual information. Extensive experiments on ActivityNet Captions demonstrate the effectiveness of our method.

Tags:

Multimedia communications

MOTION-AWARE VIDEO PARAGRAPH CAPTIONING VIA EXPLORING OBJECT-CENTERED INTERNAL KNOWLEDGE

hu yimin (Fudan University); Guorui Yu (Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics); Xuequan Lu (Deakin University); Shang Gao (Deakin University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Background Disturbance Mitigation for Video Captioning via Entity-Action Relocation

ESTIMATION OF VISUAL CONTENTS FROM HUMAN BRAIN SIGNALS VIA VQA BASED ON BRAIN-SPECIFIC ATTENTION

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Join the IEEE Signal Processing Society