MODELING LOCAL AND GLOBAL CONTEXTS FOR IMAGE CAPTIONING
Peng Yao, Jiangyun Li, Longteng Guo, Jing Liu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 07:04
Image captioning aims to first observe an image, most notably the involved objects that are highly context-dependent, and then depict it with a natural description. However, most of the current models solely use the isolated objects vectors as image representations, ignoring the contexts among them. In this paper, we introduce a Local-Global Context (LGC) network, endowing the independent object features with shortrange perception (local contexts) and long-range dependence (global contexts). LGC network can be viewed as feature refiner, much beneficial to reason the novel objects and verbal words for caption decoder. The local contexts are modeled with 1-D group convolution on adjacent objects, strengthening the local connections. Still further, self-attention mechanism is utilized to model the global contexts by correlating all the local contexts. Extensive experiments on MSCOCO dataset demonstrate that LGC network can easily plug into neural captioning models and significantly improve the model performance.