Decaying Contrast for Fine-grained Video Representation Learning
Heng Zhang (Gaoling School of Artificial Intelligence,Renmin University of China); Bing Su (Renmin University of China)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Prior contrast-based methods for video representation learning mainly focus on clip discrimination while ignoring the context and relationship of clips from the same video in the temporal dimension. As a consequence, the learned spatiotemporal representations for successive clips are inconsistent and hence perform poorly in fine-grained downstream tasks such as video fragment retrieval or localization. In this paper, we propose a decaying strategy to grasp the gradual evolution along the temporal dimension for fine-grained spatiotemporal representation learning, which consists of two novel contrastive losses. The external decaying contrastive loss is designed to increase the relative similarity of clips from the same video while the internal decaying contrastive loss aims to maintain the discrimination of clips. Experimental results show that the proposed decaying contrastive training approach achieves a significant improvement in the fine-grained video retrieval task on multiple benchmark datasets.