Spatial-Temporal Feature Aggregation Network For Video Object Detection
Zhu Chen, Weihai Li, Chi Fei, NengHai Yu, Bin Liu
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 12:38
Video object detection is a challenging problem in computer vision. In this paper, we propose a novel spatial-temporal feature aggregation network to deal with this issue. Specifically, we present a novel instance-level feature aggregation module as complementary to traditional pixel-level feature aggregation, in which we build a new movement estimation module to learn instance movements across frames. Then the Graph Convolutional Networks (GCNs) is applied to obtain temporal relation among instances over frames to implement instance-level feature aggregation. At last, we combine pixel-level and instance-level features by learnable soft weights to make use of their complementary information. Our framework is simple to implement and enables end-to-end training, which achieves state-of-art performance on the ImageNet VID dataset by extensive experiments.