VIDEO CAPTIONING VIA RELATION-AWARE GRAPH LEARNING
Yi Zheng (Fudan University); Heming Jing (Fudan University); Qiujie Xie (School of Computer Science, Fudan University); Yuejie Zhang (Fudan University); Rui Feng (Fudan University); Tao Zhang (Shanghai University of Finance and Economics); Shang Gao (Deakin University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Recent neural models for video captioning usually employed an encoder-decoder framework. However, most approaches either neglected the spatial and temporal interactions between objects in a video or implicitly modelled the interactions, resulting in less desired performance. In this paper, we propose a novel relationaware graph learning framework. It explicitly models both spatial and temporal relations for objects. In particular, a relation-aware graph is designed to depict the spatial relations between different objects in a scene. Parallelly, a temporal graph network is designed to perform relational reasoning for the same objects in adjacent frames. Features of both types of relations are learned and fused for the follow-up language decoder. Experiments on two benchmark datasets show the effectiveness of our framework. It achieves state-of-the-art performance with CIDEr scores on MSVD and MSR-VTT.