Hierarchical Spatiotemporal Feature Fusion Network for Video Saliency Prediction
Yunzuo Zhang (Shijiazhuang Tiedao University); Tian Zhang (Shijiazhuang Tiedao University); Cunyu Wu (Shijiazhuang Tiedao University); Yuxin Zheng (Shijiazhaung Tiedao University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Current video saliency prediction methods have made great progress relying on the feature extraction capability of CNN, but there are still many defects in hierarchical feature fusion, limiting the further improvement of accuracy. To address this issue, we propose a 3D convolutional Hierarchical Spatiotemporal Feature Fusion Network (HSFF-Net). Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the flow of deep semantic information. Then, different from addition and concatenation, we design a Hierarchical Adaptive Fusion (HAF) mechanism that can adaptively learn the fusion weights of adjacent features. Moreover, a Frame-wise Attention (FA) module is introduced to augment the temporal features to be fused. Our model is simple yet effective and can run in real-time. Experimental results on the three video saliency benchmarks demonstrate that the HSFF-Net outperforms existing state-of-the-art methods in accuracy.