GROUPED TEMPORAL ENHANCEMENT MODULE FOR HUMAN ACTION RECOGNITION
Hong Liu, Bin Ren, Mengyuan Liu, Runwei Ding
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:20
Temporal information is a significant cue for recognizing human actions from videos. Different from 2D CNN which can only capture spatial information in an efficient way, 3D CNN is good at capturing both spatial and temporal information at the expense of high computational cost. Beyond both methods, this paper presents a Grouped Temporal Enhancement (GTE) module which even outperforms 3D CNN, meanwhile only needs similar low computational cost as 2D CNN. The GTE module firstly decomposes an input video into spatial and temporal groups along channel dimension, and then uses a learnable temporal shift (LTS) operation for efficient temporal modeling. Finally, a 2D convolution filter is used to enhance the ability of LTS for spatial modeling. Extensive experiments on three benchmark datasets validate the effect of our method.