Image Generation is MAY All You Need for VQA
Kyung Ho Kim (ActionPower); Junseo Lee (ActionPower); Jihwa Lee (ActionPower)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Visual Question Answering (VQA) has benefited from the boost of increasingly sophisticated Pretrained Language Model (PLM) and Computer Vision-based models. In particular, many studies of language modality have been conducted by using image captioning or question generation with the knowledge ground of PLM in terms of data augmentation. However, image generation of VQA has been implemented in a limited way to modify only certain parts of the original image in order to control the quality and uncertainty. In this paper, to address this gap, we propose a method that utilizes the off-the-shelf diffusion model, pre-trained with various tasks and images, for injecting the prior knowledge base into generated images and securing diversity without losing generality about the answer. In addition, we design a effective training strategy by considering the difficulty of question to address the multiple images per QA pair. VQA model trained on our strategy improves significant performance on the dataset that requires factual knowledge without any knowledge information in language modality.