Video Captioning with Temporal and Region Graph Convolution Network

Xinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao, Weiguo Fan

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 07:35

09 Jul 2020

Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets MSVD and MSR-VTT show the effectiveness of our model.

Tags:

icme 2020

sps conference

Video Captioning with Temporal and Region Graph Convolution Network

Xinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao, Weiguo Fan

Value-Added Bundle(s) Including this Product

ICME 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society