TP-VIT: A TWO-PATHWAY VISION TRANSFORMER FOR VIDEO ACTION RECOGNITION

Yanhao Jing, Feng Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:05:30

10 May 2022

Recently, inspired by the success of Transformer in natural language processing tasks, a number of works have attempted to apply Transformer-based models to video action recognition. Existing works only use one RGB stream as the input for Transformer. How to use multiple pathways and multiple streams with Transformer for action recognition has not been studied. To address this issue, we present a novel structure namely Two-Pathway Vision Transformer (TP-ViT). Two parallel spatial Transformer encoders are used as two pathways with different framerates and resolutions of the input video. The high-resolution pathway contains more spatial information, while the high-framerate pathway contains more temporal information. The two outputs are fused and fed into a temporal Transformer encoder for action recognition. Furthermore, we also fuse skeleton features into our model to get better results. Our experiments demonstrate that our proposed models achieve the state-of-the-art results on both the coarse-grained dataset Kinetics and the fine-grained dataset FineGym.

Tags:

video action recognition

vision transformer

two-pathway transformer

TP-VIT: A TWO-PATHWAY VISION TRANSFORMER FOR VIDEO ACTION RECOGNITION

Yanhao Jing, Feng Wang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

ADA-VIT: ATTENTION-GUIDED DATA AUGMENTATION FOR VISION TRANSFORMERS

IMAGE INPAINTING BY MSCSWIN TRANSFORMER ADVERSARIAL AUTOENCODER

PYRAMID TRANSFORMER DRIVEN MULTIBRANCH FUSION FOR POLYP SEGMENTATION IN COLONOSCOPIC VIDEO IMAGES

Join the IEEE Signal Processing Society