Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:11:18
08 Jun 2021

In essence, visual question answering (VQA) is an embedding and transformation process between two modalities of image and text. In this process, the critical problems of effectively embedding the question feature and image feature as well as transforming the features to the prediction of answer are still faithfully unresolved. In this paper, depending on these problems, a semantic-associated attention method and a reinforcement stacked learning mechanism are proposed. Firstly, within the associations of high-level semantics, a visual spatial attention model (VSA) and a multi-semantic attention model (MSA) are proposed to extract the low-level image feature and high-level semantic feature, respectively. Furthermore, we develop a reinforcement stacked learning architecture, which splits the transformation process into multiple stages, to gradually approach the answers. At each stage, a new reinforcement learning (RL) method is introduced to directly criticize inappropriate answers to optimize the model. The extensive experiments on the VQA task show that our method can achieve state-of-the-art performance.

Chairs:
Dong Tian

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00