Gaussian Kernel-Based Cross Modal Network For Spatio-Temporal Video Grounding
Zeyu Xiong, Daizong Liu, Zhou Pan
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:26
Natural image matting is a challenging and significant task in computer vision. Recently, image matting achieves fantastic development by introducing deep learning methods. To the best of our knowledge, there is no image matting method using the Transformer. Compared with CNNs, the Transformer pays more attention to the interest points and the relationships of content, which is beneficial to the image matting task. in this paper, we first present a novel Transformer-based image matting method with Shifted Window self-Attention. Specifically, our method contains two encoders, an alpha encoder and a context encoder. The former leverages the Transformer with Shifted Window self-Attention to extract features of de-tails, such as hairs, feathers and porous parts of foreground objects. Shifted Window self-Attention focuses on patches with the size of the window and connections of adjacent patches. With this, the Transformer is capable of dealing with high-resolution images. The context encoder, which takes rescaled images as input, aims to extract the whole structure information of foreground objects. Then, we propose a novel Hierarchical Pyramid Pooling Module (HPPM) which enables the network have the flexibility to extract features at various resolutions. Experiments show that our method achieves competitive performance on the Composition-1K dataset.