Optimal Noise-Aware Imaging With Switchable Prefilters
Zilai Gong, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:11:45
Visual question answering (VQA) answers text-based questions about images. The difficulty of VQA lies in the accurate localization of the region related to the question. in this paper, we introduce the \emph{ques-to-visual (q2v)} feature as the additional input of VQA to tackle this problem. The \emph{q2v} feature is generated according to the semantics of the question, containing visual semantics that is helpful to locate the region related to the question. We then use self-attention to model the intra-relationship in each modality to enhance different features, i.e., \emph{q2v}, image, and text features. The enhanced features are then fused by spatial guided-attention and multi-scale channel attention modules for the answer prediction. Experimental results on the VQA2.0 benchmark dataset show that our method achieves higher performance when compared with other methods.