Action Relational Graph For Weakly-Supervised Temporal Action Localization
Yi Cheng, Ying Sun, Dongyun Lin, Joo-Hwee Lim
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:22
The task of weakly-supervised temporal action localization (WTAL) is to recognize plentiful unstructured actions in untrimmed videos with only video-level class labels. As various actions may occur in an untrimmed video, it is desirable to capture the correlation among different actions to effectively identify the target actions. In this paper, we propose a novel Action Relational Graph Network (ARG-Net) to model the correlation between action labels. Specifically, we build a co-occurrence graph using Graph Convolutional Network (GCN), where the graph nodes and edges are represented by word embedding of action labels and relations between two labels, respectively. Then we apply the GCNs to project the action label embeddings into a set of correlated action classifiers which are multiplied with the learned video representations for video-level classification. To facilitate discriminative video representation learning, we employ the attention mechanism to model the probability of a frame containing action instances. A new Action Normalization Loss (ANL) is proposed to further alleviate the confusion from irrelevant background frames (\ie, frames containing no actions). Experimental results on THUMOS14 and ActivityNet1.2 datasets demonstrate that our ARG-Net outperforms the state-of-the-art methods.