Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:06:22
22 Sep 2021

The task of weakly-supervised temporal action localization (WTAL) is to recognize plentiful unstructured actions in untrimmed videos with only video-level class labels. As various actions may occur in an untrimmed video, it is desirable to capture the correlation among different actions to effectively identify the target actions. In this paper, we propose a novel Action Relational Graph Network (ARG-Net) to model the correlation between action labels. Specifically, we build a co-occurrence graph using Graph Convolutional Network (GCN), where the graph nodes and edges are represented by word embedding of action labels and relations between two labels, respectively. Then we apply the GCNs to project the action label embeddings into a set of correlated action classifiers which are multiplied with the learned video representations for video-level classification. To facilitate discriminative video representation learning, we employ the attention mechanism to model the probability of a frame containing action instances. A new Action Normalization Loss (ANL) is proposed to further alleviate the confusion from irrelevant background frames (\ie, frames containing no actions). Experimental results on THUMOS14 and ActivityNet1.2 datasets demonstrate that our ARG-Net outperforms the state-of-the-art methods.

Value-Added Bundle(s) Including this Product

More Like This