ProContEXT: Exploring Progressive Context Transformer for Tracking

Jin-Peng Lan (DAMO Academy, Alibaba Group); Zhi-Qi Cheng (Carnegie Mellon University); Jun-Yan He (DAMO Academy, Alibaba Group); Chenyang Li (DAMO Academy, Alibaba Group); Bin Luo (DAMO Academy, Alibaba Group); Xu Bao (DAMO Academy, Alibaba Group); Wangmeng Xiang (DAMO Academy, Alibaba Group); Yifeng Geng (Alibaba Group); Xuansong Xie (DAMO Academy, Alibaba Group)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

07 Jun 2023

Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. To this end, we revamped the tracking framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories. Specifically, ProContEXT leverages a context-aware self-attention module to encode the spatial and temporal context, refining and updating the multi-scale static and dynamic templates to progressively perform accurately tracking. It explores the complementary between spatial and temporal context, raising a new pathway to multi-context modeling for transformer-based trackers. In addition, ProContEXT revised the token pruning technique to reduce computational complexity. Extensive experiments on popular benchmark datasets such as GOT-10k and TrackingNet demonstrate that the proposed ProContEXT achieves state-of-the-art performance.

Tags:

Image and video content analysis

ProContEXT: Exploring Progressive Context Transformer for Tracking

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

IMAGE COMPLETION VIA DUAL-PATH COOPERATIVE FILTERING

PROGRESSIVE REFINEMENT LEARNING BASED ON FEATURE CROSS PERCEPTION FOR RESIDENTIAL AREAS SEMANTIC SEGMENTATION

OPT: One-shot Pose-Controllable Talking Head Generation

Join the IEEE Signal Processing Society