Video Captioning with Temporal and Region Graph Convolution Network
Xinlong Xiao, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao, Weiguo Fan
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 07:35
Video captioning aims to generate a natural language description for a given video clip that includes not only spatial information but also temporal information. To better exploit such spatial-temporal information attached to videos, we propose a novel video captioning framework with Temporal Graph Network (TGN) and Region Graph Network (RGN). TGN mainly focuses on utilizing the sequential information of frames that most of existing methods ignore. RGN is designed to explore the relationships among salient objects. Different from previous work, we introduce Graph Convolution Network (GCN) to encode frames with their sequential information and build a region graph for utilizing object information. We also particularly adopt a stack GRU decoder with a coarse-to-fine structure for caption generation. Very promising experimental results on two benchmark datasets MSVD and MSR-VTT show the effectiveness of our model.