Skip to main content

TEXT-IMAGE DE-CONTEXTUALIZATION DETECTION USING VISION-LANGUAGE MODELS

Mingzhen Huang, Shan Jia, Siwei Lyu, Ming-Ching Chang

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:07:48
12 May 2022

Text-image de-contextualization, which uses inconsistentimage-text pairs, is an emerging form of misinformation anddrawing increasing attention due to the great threat to infor-mation authenticity. With real content but semantic mismatchin multiple modalities, the detection of de-contextualizationis a challenging problem in media forensics. Inspired by therecent advances in vision-language models with powerfulrelationship learning between images and texts, we leveragethe vision-language models to the de-contextualization de-tection task. Two popular models, namely CLIP and VinVL,have been evaluated and compared on several news and so-cial media datasets to show their performance in detectingimage-text inconsistency in de-contextualization. We alsosummarize interesting observations and shed lights to the useof vision-language models in de-contextualization detection.