Monocular 3D Human Pose Estimation Based on Global Temporal-Attentive and Joints-Attention in Video
ruhan He (Wuhan Textile University); shanshan xiang (Wuhan Textile University); Tao Peng (Wuhan Textile University); Yongsheng Yu (武汉理工大学)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Learning to capture human motion is essential to 3D human pose and shape estimation from monocular video, which is widely used in many 3D applications. However, the existing methods mainly rely on recurrent or convolutional operation to model such temporal information, which limits the ability to capture non-local contextual relations of human motion and ignores human joint hierarchies. To address this problem, we propose a Global Temporal-Attentive and Joints-Attention network (GTAJA-Net). This method introduces a Global Attention Feature Integration (GAFI) module and a Motion Tree Fusion Decoder (MTFD) module on the basis of a temporally consistent mesh recovery system (TCMR). A GAFI consisting of a collection of temporal features obtains final temporal features carrying spatial information that enhances temporal correlation and refine the features of the current frame. Meanwhile, MTFD aims at modeling the joint level attention. MTFD considers pose estimation as a top-down hierarchical process similar to SMPL kinematic tree. Though conceptually simple, our GTAJA-Net outperforms the state-of-the-art methods on the 3DPW, MPI-INF-3DHP, and Human3.6M benchmark datasets. Our code is available at https://github.com/xiangcece/GTAJA-Net.