Semantic-Preserving Augmentation For Robust Image-Text Retrieval
Sunwoo Kim (Seoul National University); Kyuhong Shim (Seoul National University); Luong Trung Nguyen (Seoul National University); Byonghyo Shim (Seoul National University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Image-text retrieval is a task to search for the proper textual descriptions of the visual world and vice versa. One challenge of this task is the vulnerability to input image/text corruptions. Such corruptions are often unobserved during the training, and degrade the retrieval model's decision quality substantially. In this paper, we propose a novel image-text retrieval technique, referred to as robust visual semantic embedding (RVSE), which consists of novel image-based and text-based augmentation techniques called semantic-preserving augmentation for image (SPAug-I) and text (SPAug-T). Since SPAug-I and SPAug-T change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic-aware embedding vectors regardless of the corruption, improving the model's robustness significantly. From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.