Skip to main content

Spatial Cross-Attention for Transformer-based Image Captioning

Khoa Anh Ngo (Seoul National University); Kyuhong Shim (Seoul National University); Byonghyo Shim (Seoul National University)

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
07 Jun 2023

Transformer-based networks have achieved great success in image captioning because of the attention mechanism that finds relevant image locations for each word. However, the current cross-attention process, which aligns word-to-image, does not consider the spatial relationships existing in patch-to-patch. This lack of spatial information may cause incorrect descriptions that fail at generating words that correctly describe the positional relationships. In this paper, we introduce a novel cross-attention architecture that utilizes spatial information from coordinate differences between relevant image patches. In doing so, our new cross-attention process dynamically considers both the related contents and their spatial relationships in caption generation. In addition, we introduce an efficient implementation of relative spatial attention based on convolutional operations. Experimental results show that the proposed spatial cross-attention improves captions to correctly describe the spatial relationships of objects, leading to an increase of 0.7 CIDEr score on the MS-COCO dataset compared to the previous state-of-the-art.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00