Weichen Xu (Peking University); Tianhao Fu (Peking University)
IEEE Members: $11.00
07 Jun 2023
Monocular 3D object detection is a challenging problem in self-driving and computer vision communities. Previous works suffered from a severe seesaw phenomenon: multi-category learning was worse than single-category, and feature learning between categories inhibited each other. We reveal that the real culprit is the significant difference in depth distribution between categories. Confusing feature representations exacerbate depth estimation. In this paper, we propose Language Knowledge Transferring to introduce language information in monocular 3D object detection, termed as MonoLT. Multimodal language-Image guides networks learn more class-specific features, which reduces the pressure of depth estimation. Meanwhile, we propose the Polar Depth Aggregator to make the depth estimation less disturbed by the environment and other instances (especially different classes). Comprehensive experiments performed on the KITTI dataset prove the superiority of our proposed method.