IID-NORD: A Comprehensive intrinsic Image Decomposition Dataset
Diclehan Ulucan, Oguzhan Ulucan, Marc Ebner
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:02
Vision and language models are easily transferred to other tasks. in particular, they have been demonstrated to work well in the evaluation of automatic image captioning. This has made it possible to evaluate systems without the need for references or additional information apart from the image and the caption. However, these models do not provide a straightforward way of evaluating videos. in this paper, we propose using these models for video captioning evaluation. We explore the use of both single image-based evaluation and different methods to include data from multiple frames. Experiments demonstrate that using clustering methods to select a few frames to compute the final score gives an excellent correlation with human judgment. The bias in the human annotations can also influence the metric, so we propose filtering the human assessments to discard outliers and improve the evaluation process.