MULTI-SEMANTIC ALIGNMENT CO-REASONING NETWORK FOR VIDEO QUESTION ANSWERING
Min Peng, Liangchen Liu, Zhenghao Li, Yu Shi, Xiangdong Zhou
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Video question answering challenges models on understanding textual questions with varying complexity and searching for clues from visual content with different hierarchical semantics. In this paper, we propose a novel Multi-Semantic Alignment Co-Reasoning Network (MACN) to accomplish an interactive inference between the question and the video input. The design of our MACN comprises two modules of Question-Centric Interaction (QCI) and Contextual Semantic Reasoning (CSR). Specifically, QCI establishes a question-centric heterogeneous graph model to align visual content at different temporal scales with questions to enable the extraction of visual representations under better textual understanding. CSR exploits self-attention mechanisms to extract the contextual dependencies of visual semantics at different hierarchies to achieve co-reasoning of answer clues. Experiments on three benchmarks demonstrate that our proposed method is superior to previous state-of-the-art performance.