LEARNING CROSS-MODAL REPRESENTATIONS FOR LANGUAGE-BASED IMAGE MANIPULATION

Kenan Emir Ak, Ying Sun, Joo Hwee Lim

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 13:29

26 Oct 2020

In this paper, we propose a generative architecture for manipulating images/scenes with natural language descriptions. This is a challenging task as the generative network is expected to perform the given text instruction without changing the non-affiliating contents of the input image. Two main drawbacks of the existing methods are their limitation of performing changes that would affect only a limited region and the inability of handling complex instructions. The proposed approach, designed to address these limitations initially uses two sets of networks to extract the image and text features respectively. Rather than a simple combination of these two modalities during the image manipulation process, we use an improved technique to compose image and text features. Additionally, the generative network utilizes similarity learning to improve text manipulation which also enforces only the text-relevant changes on the input image. Our experiments on CSS and Fashion Synthesis datasets show that the proposed approach performs remarkably well and outperforms the baseline frameworks in terms of R-precision and FID.

Tags:

sps conference

icip 2020

LEARNING CROSS-MODAL REPRESENTATIONS FOR LANGUAGE-BASED IMAGE MANIPULATION

Kenan Emir Ak, Ying Sun, Joo Hwee Lim

Value-Added Bundle(s) Including this Product

ICIP 2020 Virtual Conference - Presentation Videos Product Bundle

More Like This

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

IEEE ICASSP 2024, 1 4-19 April 2024, Seoul, Korea. Conference Presentation Videos Bundle

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

Join the IEEE Signal Processing Society