D-Conformer: Deformable Sparse Transformer Augmented Convolution for Voxel-based 3D Object Detection
Xiao Zhao (Fudan University); Liuzhen Su (Fudan University); Xukun Zhang (Fudan University); Dingkang Yang (Fudan University); Mingyang Sun (Fudan University); Shunli Wang (Fudan University); Peng Zhai (Fudan university); Lihua Zhang (Fudan University)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Although CNN-based and Transformer-based detectors have made impressive improvements in 3D object detection, these two network paradigms suffer from the interference of insufficient receptive field and local detail weakening, which significantly limits the feature extraction performance of the backbone. In this paper, we propose to fuse convolution and transformer, and simultaneously considering the different contributions of non-empty voxels at different positions in 3D space to object detection, it is not consistent with applying standard convolution and transformer directly on voxels. Specifically, we design a novel deformable sparse transformer to perform long-range information interaction on fine-grained local detail semantics aggregated by focal sparse convolution, termed D-Conformer. D-Conformer learns valuable voxels with position-wise in sparse space and can be applied to most voxel-based detectors as a backbone. Extensive experiments demonstrate that our method achieves satisfactory detection results and outperforms state-of-the-art 3D detection methods by a large margin.