OUTSIDE KNOWLEDGE VISUAL QUESTION ANSWERING VERSION 2.0

Benjamin Reichman (Georgia Institute of Technology); Anirudh S Sundar (Georgia Institute of Technology); Christopher G Richardson (Georgia Institute of Technology); Tamara Zubatiy (Georgia Institute of Technology); Prithwijit Chowdhury (Georgia Institute of Technology); Aaryan Shah (Georgia Institute of Technology); Jack Truxal (Georgia Institute of Technology); Micah Grimes (Georgia Institute of Technology); Dristi Shah (Georgia Institute of Technology ); Woo Ju Chee (Georgia Institute of Technology); Saif Punjwani (Georgia Institute of Technology); Atishay Jain (Georgia Institute of Technology); Larry Heck (Georgia Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Visual question answering (VQA) lies at the intersection of language and vision research. It functions as a building block for multimodal conversational AI and serves as a testbed for assessing a model’s capability for open-domain scene understanding. While progress in this area was initially accelerated with the 2015 release of the popular and large dataset "VQA", new datasets are required to continue this research momentum. For example, the 2019 Outside Knowledge VQA dataset "OKVQA" extends VQA by adding more challenging questions that require complex, factual, and commonsense knowledge. However, in our analysis, we found that 41.4% of the dataset needed to be corrected and 10.6% needed to be removed. This paper describes the analysis, corrections, and removals completed and presents a new dataset: OK-VQA Version 2.0. To gain insights into the impact of the changes on OK-VQA research, the paper presents results on state-of-the-art models retrained with this new dataset. The side-by-side comparisons show that one method in particular, Knowledge Augmented Transformer for Vision-and-Language, extends its relative lead over competing methods.

Tags:

Multimodal processing of language