Visual Graph Reasoning Network
Dingbang Li (ECNU); Xin Lin (ECNU); Haibin Cai (East China Normal University); Wenzhou Chen (Zhejiang University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Visual question answering (VQA) is a fundamental and challenging cross-modal task. This task requires the model to fully understand the image's content and reason out the answer based on the question. Existing VQA models understand visual content mainly based on bottom-up or grid features. However, both types of vision features have some drawbacks. The discreteness and independence of bottom-up features prevent models from adequately performing relational reasoning. Image segmentation by grid features leads to the fragmentation of meaningful visual regions, limiting the cross-modal alignment capability of the model. Therefore, we proposed a more flexible method called Visual Graph. It can connect different patches according to semantic similarity and spatial relevance to model the potential relationships and cluster the adjacent homologous patches. Based on the Visual Graph, we designed a Visual Graph Reasoning Network for VQA. We evaluated our model on GQA and VQA-v2. The experimental results show that our models can achieve excellent performance between single models.