Exploring Entity-Level Spatial Relationships For Image-Text Matching
Yaxian Xia, Lun Huang, Wenmin Wang, Xiao-Yong Wei, Jie Chen
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 09:34
Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences `Boats are on the water' and `Boats are under the water' describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MS-COCO and Flickr30K datasets.