Dual-Feature Enhancement for Weakly Supervised Temporal Action Localization
Siying Liu (University of Science and Technology of China); Qiankun Liu (Beijing Institute of Technology); Qi Chu (University of Science and Technology of China); Bin Liu (University of Science and Technology of China); Nenghai Yu (University of Science and Technology of China)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Weakly-supervised Temporal Action Localization (WTAL) aims at localizing actions in untrimmed videos with only video-level labels. Most existing methods embrace a ''localization by classification'' paradigm and adopt a model that pre-trained with recognition task for feature extraction. The gap between recognition and localization tasks leads to inferior performance. Some recent works attempt to utilize feature enhancement to obtain better feature for localization and boost the performance to some extent. However, they are limited to intra-video information exploiting, while ignoring meaningful inter-video information in the dataset. In this paper, we propose a novel Dual-Feature Enhancement (DFE) method for WTAL which can utilize both intra- and inter-video information. For intra-video, a local feature enhancement module is designed to promote the feature interaction along the temporal dimension within each video. For inter-video information, a global memory module is firstly designed to learn the representations for different categories across different videos. Then, a global feature enhancement module is used to enhance the video features with the help of those global representations in the memory. Besides, to reduce the extra computational cost caused by global enhancement in the inference stage, a distillation loss is applied to enforce the local branch feature learn the information from global branch, so that the global enhancement could be removed during inference. The proposed method achieves state-of-the-art performance on popular benchmarks.