TP-VIT: A TWO-PATHWAY VISION TRANSFORMER FOR VIDEO ACTION RECOGNITION
Yanhao Jing, Feng Wang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:30
Recently, inspired by the success of Transformer in natural language processing tasks, a number of works have attempted to apply Transformer-based models to video action recognition. Existing works only use one RGB stream as the input for Transformer. How to use multiple pathways and multiple streams with Transformer for action recognition has not been studied. To address this issue, we present a novel structure namely Two-Pathway Vision Transformer (TP-ViT). Two parallel spatial Transformer encoders are used as two pathways with different framerates and resolutions of the input video. The high-resolution pathway contains more spatial information, while the high-framerate pathway contains more temporal information. The two outputs are fused and fed into a temporal Transformer encoder for action recognition. Furthermore, we also fuse skeleton features into our model to get better results. Our experiments demonstrate that our proposed models achieve the state-of-the-art results on both the coarse-grained dataset Kinetics and the fine-grained dataset FineGym.