LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION

Bin Li (Hunan University); Yixuan Weng (CASIA); Bin Sun (Hunan University); Shutao Li (Hunan University)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.

Tags:

Multimedia in healthcare, education, art, and social sciences

LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION

Bin Li (Hunan University); Yixuan Weng (CASIA); Bin Sun (Hunan University); Shutao Li (Hunan University)

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

On the Role of Visual Context in Enriching Music Representations

Join the IEEE Signal Processing Society