Adopting Self-supervised Learning into Unsupervised Video Summarization through Restorative score.
Mehryar Abbasi Boroujeni, Parvaneh Saeedi
-
SPS
IEEE Members: $11.00
Non-members: $15.00
In this paper, we present a new process for creating video summaries in an unsupervised manner. Our approach involves training a transformer encoder model to reconstruct missing frames in a video in a self-supervised way using the partially masked video as input. We then introduce an algorithm that utilizes the above-trained encoder to generate an importance score for each frame. Such frame importance scores are used to create the summary of the video. We show that the reconstruction loss of the model for a video with masked frames correlates with the representativeness of the remaining frames in the video. We validate the effectiveness of our approach on two benchmark datasets of TVSum and SumMe. We demonstrate that it outperforms state-of-the-art (SOTA) methods. Additionally, our approach is more stable during the training process compared to SOTA techniques based on generative adversarial learning. Our source code is publicly available.