MutAtt: Visual-textual Mutual Guidance for Referring Expression Comprehension
Shuai Wang, Fan Lyu, Wei Feng, Song Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 07:41
Referring expression comprehension (REC) aims to localize a text-related region in a given image by a referring expression in natural language. Existing methods focus on how to build convincing visual and language representations independently, which may significantly isolate visual and language information. In this paper, we argue that for REC the referring expression and the target region are semantically correlated and subject, location and relationship consistency exist between vision and language. On top of this, we propose a novel approach called MutAtt to construct mutual guidance between vision and language, which treats vision and language equally thus yields compact information matching. Specifically, for each module of subject, location and relationship, MutAtt builds two kinds of attention-based mutual guidance strategies. One strategy is to generate vision-guided language embedding for the sake of matching relevant visual features. The other reversely generates language-guided visual features to match relevant language embedding. This mutual guidance strategy can effectively enforce the vision-language consistency in three modules. Experiments on three popular REC datasets demonstrate that the proposed approach outperforms the current state-of-the-art methods.