Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:12:10
19 Oct 2022

Video-to-text (VTT) is the task of automatically generating descriptions for short audio-visual video clips. It can help visually impaired people to understand scenes shown in a YouTube video, for example. Transformer architectures have shown great performance in both machine translation and image captioning. in this work, we transfer promising approaches from image captioning and video processing to VTT and develop a straightforward Transformer architecture. Then, we expand this Transformer by a novel way of synchronizing audio and video features in Transformers which we call Fractional Positional Encoding (FPE). We run multiple experiments on the VATEX dataset and improve the CIDEr and BLEU-4 scores by 21.72 and 8.38 points compared to a vanilla Transformer network and achieve state-of-the art results on the MSR-VTT and MSVD datasets. Also, our novel FPE helps increase the CIDEr score by relative 8.6%.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00