Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:10:23
09 May 2022

Attention supervision encourages grounded video description models (GVDMs) to focus on the related visual content when generating words. Thus, it improves the description performance of GVDMs. However, existing GVDMs often fail to focus on small but informative regions because these regions are considered as negative by using the intersection-over-union (IoU) based attention groundtruth sampling method. Moreover, the prevailing attention loss functions enforce the GVDMs to focus equally on all sampled regions when the GVDMs generate words, which may make it difficult for the model to attend to informative regions and thus degrade the quality of the generated sentences. To alleviate the above problems, we propose an informative attention supervision method including a novel attention groundtruth sampling method and a group-based weak grounding supervision. Specifically, our attention groundtruth sample method captures small proposal regions that overlap with the entity boxes. The proposed grounding supervision allows the GVDMs to dynamically focus on some of the most informative attention regions instead of all of them. Our approach yields competitive results on the ActivityNet Entities dataset without bells and whistles, surpassing previous methods without increasing inference costs.