Exploring Dual Stream Global Information for Image Captioning
Tiantao Xian, Zhixin Li, Tianyu Chen, Huifang Ma
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:18
In recent years, image caption methods based on the encoder-decoder framework have made promising achievements, but most of them lack the exploitation of global information. In general, visual global information can provide more fine-grain details for recognizing small objects. On the other hand, the textual global information provides a coarse understanding of the visual scene. In this paper, we propose Dual Global Enhanced Transformer (DGET) to explicitly utilize both visual and textual global information. In encoding stages, we complement two visual features with different properties to obtain a global enhanced visual representation by a novel Global Enhanced Encoder (GEE). During decoding, we proposed Global Enhanced Decoder (GED) to utilize the textual global information explicitly. To validate our model, we conduct extensive experiments on the COCO image captioning dataset and achieve superior performance over many state-of-the-art methods.