Efficient and Accurate Skeleton-Based Two-Person interaction Recognition Using inter- and intra-Body Graphs

Yoshiki Ito, Quan Kong, Kenichi Morita, Tomoaki Yoshinaga

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:12:10

19 Oct 2022

Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. in this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6%.

Tags:

International Conference on Image Processing

IEEE ICIP 2022

icip

Efficient and Accurate Skeleton-Based Two-Person interaction Recognition Using inter- and intra-Body Graphs

Yoshiki Ito, Quan Kong, Kenichi Morita, Tomoaki Yoshinaga

Value-Added Bundle(s) Including this Product

ICIP 2022, October 16-19, 2022, Bordeaux, France - Presentation Videos Product Bundle

More Like This

Statistical Analysis of inter Coding in Vvc Test Model (Vtm)

Combining Non-Data-Adaptive Transforms For Oct Image Denoising By Iterative Basis Pursuit

Deep Weighted Consensus Dense Correspondence Confidence Maps For 3D Shape Registration

Join the IEEE Signal Processing Society