Exploring vision transformer layer choosing for semantic segmentation

Fangjian Lin (alibaba-inc); Yizhe Ma (Xinjiang University); Shengwei Tian (Xinjiang University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Extensive work has demonstrated the effectiveness of Vision Transformer. The plain Vision Transformer tends to obtain multi-scale features by selecting fixed layers, or the last layer of features aiming to achieve higher performance in dense prediction tasks. However, this selection is often based on manual operation. And different samples often exhibit different features at different layers (e.g., edge, structure, texture, detail, etc.). This requires us to seek a dynamic adaptive fusion method to filter different layer features. In this paper, unlike previous encoder and decoder work, we design a neck network for adaptive fusion and feature selection, called ViTController. We validate the effectiveness of our method on different datasets and models, and surpass previous state-of-the-art methods. Finally our method can also be used as a plug-in module and inserted into different networks.

Tags:

Image and video content analysis

Exploring vision transformer layer choosing for semantic segmentation

Fangjian Lin (alibaba-inc); Yizhe Ma (Xinjiang University); Shengwei Tian (Xinjiang University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

IMAGE COMPLETION VIA DUAL-PATH COOPERATIVE FILTERING

PROGRESSIVE REFINEMENT LEARNING BASED ON FEATURE CROSS PERCEPTION FOR RESIDENTIAL AREAS SEMANTIC SEGMENTATION

OPT: One-shot Pose-Controllable Talking Head Generation

Join the IEEE Signal Processing Society