Two-Pathway Transformer Network For Video Action Recognition
Bo Jiang, Jiahong Yu, Lei Zhou, Kailin Wu, Yang Yang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:06:19
Traditional two-stream neural networks have shown that both appearance and motion information are important for video action recognition. However, their way of naively averaging two streams' scores at the end of the framework neglects the underlying relationship between these two kinds of information. In this paper, we propose a two-pathway transformer network that uses memory-based attention to explore such relationship, which further improves the classification performance. Specifically, a transformer-based decoder takes one pathway's features as the query while the other's as the key and value. Then based on the similarity matrix estimated by the query and key, relevant information from the value can be selected to enhance the query for the final classification task. Experiments demonstrate that our proposed method outperforms existing fusion strategies at the end of the two-stream methods.