-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:55
Deep learning has significantly improved the precision of object detection with abundant labeled data. However, collecting sufficient data and labeling this data is extremely hard. Zero-shot object detection (ZSD) has been proposed to solve this problem which aims to simultaneously recognize and localize both seen and unseen objects. Recently, the transformer and its variant architectures have shown their effectiveness over conventional methods in many natural language processing and computer vision tasks. In this paper, we study the ZSD task and develop a new framework named zero-shot object detection with transformers (ZSDTR). ZSDTR consists of the head network, transformer encoder, transformer decoder and the vision-semantic-attention trail network. We find that the transformer is very effective for improving the ability to recall unseen unseen objects and the tail is used to discriminate seen and unseen objects. As far as we know, our ZSDTR is the first method to use transformer in ZSD task. Extensive experimental results on various zero-shot object detection benchmarks show that our ZSDTR outperforms the current state-of-the-art methods.