Video-Grounded Dialogues With Joint Video and Image Training

Hangyu Zhang, Yingming Li, Zhongfei Zhang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:06:24

03 Oct 2022

in this paper, we present a \textit{Grad-Cam} aware supervised attention framework for visual question answering (VQA) tasks for post-disaster damage assessment purposes. Visual attention in visual question-answering tasks aims to focus on relevant image regions according to questions to predict answers. However, the conventional attention mechanisms in VQA work in an unsupervised manner, learning to give importance to visual contents by minimizing only task-specific loss. This approach fails to provide appropriate visual attention where the visual contents are very complex. The content and nature of UAV images in \textit{FlooNet-VQA} dataset are very complex as they depict the hazardous scenario after \textit{Hurricane Harvey} from a high altitude. To tackle this, we propose a supervised attention mechanism that uses explainable features from \textit{Grad-Cam} to supervise visual attention in the VQA pipeline. The mechanism we propose operates in two stages. in the first stage of learning, we derived the visual explanations through \textit{Grad-Cam} by training a baseline attention-based VQA model. in the second stage, we supervise our visual content for each question by incorporating the \textit{Grad-Cam} explanations from the previous phase of the training process. We have improved the model performance over the state-of-the-art VQA models by a considerable margin on \textit{FloodNet} dataset.

Tags:

International Conference on Image Processing

IEEE ICIP 2022

icip

Video-Grounded Dialogues With Joint Video and Image Training

Hangyu Zhang, Yingming Li, Zhongfei Zhang

Value-Added Bundle(s) Including this Product

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

More Like This

Deep Weighted Consensus Dense Correspondence Confidence Maps For 3D Shape Registration

Statistical Analysis of inter Coding in Vvc Test Model (Vtm)

Combining Non-Data-Adaptive Transforms For Oct Image Denoising By Iterative Basis Pursuit

Join the IEEE Signal Processing Society