FREQUENCY ENHANCEMENT NETWORK FOR EFFICIENT COMPRESSED VIDEO ACTION RECOGNITION
Yue Ming, Lu Xiong, Xia Jia, Qingfang Zheng, Jiangwan Zhou, Fan Feng, Nannan Hu
-
SPS
IEEE Members: $11.00
Non-members: $15.00
The existing frequency-based action recognition methods achieve impressive performance in improving efficiency. However, they ignore the low-frequency texture and edge clues, leading to accuracy degradation. To address this problem, we propose a novel frequency enhancement (FE) block for efficient compressed video action recognition, including a temporal-channel two-heads attention (TCTHA) module and a frequency overlapping group convolution (FOGC) module. First, the TCTHA module emphasizes the inter-frame temporal context and the inner-frame informative frequency semantics by attention.Then, the FOGC module groups channels in different frequency bands with overlap, to extract low-frequency texture and edge clues, while maintaining the interaction of groups. We integrate the FE block into 2D-CNNs with frequency I-frame input, termed FENet, focusing on the pivotal low-frequency spatio-temporal semantics for action recognition. Experiments on HMDB-51, UCF-101, Kinetics-400, and Kinetics-700 verify that our FENet achieves comparable accuracy compared with RGB-based methods with high efficiency.