End-to-End Non-Autoregressive Image Captioning

Hong Yu (Dalian University of Technology); Yuanqiu Liu (Dalian University of Technology); BaoKun Qi (Dalian University of Technology); Zhaolong Hu (Dalian University of Technology); Han Liu (Dalian University of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

08 Jun 2023

Most of the existing image captioning models use the autoregressive approach to generate captions, which leads to high latency in the inference process. Non-autoregressive decoding generates words in parallel, which greatly improves the model inference speed. However, non-autoregressive decoding usually leads to performance loss due to the loss of word input. In this paper, we propose a semantic retrieval module that uses image features to retrieve semantic information as input of the non-autoregressive decoder, narrowing the performance gap between the non-autoregressive and the autoregressive model. Furthermore, we adopt Swin-Transformer instead of Faster R-CNN to extract image features, thus building an end-to-end image caption model. Experiments conducted on the MSCOCO dataset show that our model achieves new state-of-the-art performances of 122.6% CIDEr score on the 'Karpathy' offline test split with 37× inference speedup.

Tags:

Machine learning for image processing

End-to-End Non-Autoregressive Image Captioning

Hong Yu (Dalian University of Technology); Yuanqiu Liu (Dalian University of Technology); BaoKun Qi (Dalian University of Technology); Zhaolong Hu (Dalian University of Technology); Han Liu (Dalian University of Technology)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Learning Generalizable Light Field Networks from Few Images

M2TSR: Multi-range and Mix-grained Transformer for Single Image Super-Resolution

Multistage Spatial Context Models for Learned Image Compression

Join the IEEE Signal Processing Society