EFFICIENT TEMPORAL-SPATIAL FEATURE GROUPING FOR VIDEO ACTION RECOGNITION
Zhikang Qiu, Xu Zhao, Zhilan Hu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 10:24
Temporal information plays an important role in action recognition. Recently, 3D CNN is widely used in extracting temporal features from videos. Compared to 2D CNN, 3D CNN has more parameters and brings heavy computation burden. It is necessary to improve the efficiency of action recognition. In this paper, inspired by group convolution and convolution kernel decomposition, we propose a novel module called grouped decomposed module (GDM) which separates channels into three groups and applies 3D, 2D and 1D convolution in parallel respectively. This module extracts spatial and temporal features efficiently. Based on GDM, we design a new network named grouped decomposed network (GDN). The grouped decomposed network achieves state-of-the-art performance on two temporal-related datasets (Something-Something V1&V2) but requires few parameters and FLOPs.