Is the U-NET Directional-Relationship Aware?
Mateus Riva, Pietro Gori, Florian Yger, Isabelle Bloch
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:15:28
Pedestrian detection is a key task in intelligent video surveillance systems which requires both fast inference and high detection accuracy. Although single-stage deep learning pedestrian detectors have achieved relatively high detection accuracy with simpler architecture and less inference time, their performance is limited compared to two-stage methods. The reason is the lack of scale-aware features without the assistance of proposal regions. To overcome this, a multi-scale deformable transformer encoder-based module is proposed. It can extract the sparse important features at deformable sampling locations from multiple levels. The proposed architecture significantly improves the performance compared to the baseline center and scale prediction method on both Caltech and Citypersons datasets. It even outperforms the state-of-the-art two-stage methods in detecting heavily occluded pedestrians on Citypersons validation set.