DATASET-LEVEL DIRECTED IMAGE TRANSLATION FOR CROSS-DOMAIN CROWD COUNTING
Xin Tan, Hiroshi Ishikawa
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Most crowd counting methods rely on a large amount of manually labeled data to train a supervised model. With the availability of synthetic dataset, one way to alleviate the scarcity of large-scale dataset is to use an image-to-image translation method to adapt synthetic data for training. However, previous methods focus on adapting local visual feature of the image, which leads to distorted and blurry translation results. In this paper, we propose a novel CLIP-guided image-to-image translation method, based on the observation that synthetic and real images can be easily separated in CLIP’s embedding space. We make use of the difference between two domains in the CLIP-space as a consistent guide to train an image translator. Then a crowd counting model is trained using images translated from synthetic data by the translator. Experiments on real-world crowd counting datasets demonstrate the effectiveness of the proposed method which enables the crowd counting model to achieve a state-of-the-art performance.