Sunjae Yoon (KAIST); Ji Woo Hong (KAIST); SooHwan Eom (KAIST); Hee Suk Yoon (KAIST); Eunseop Yoon (KAIST); Daehyeok Kim (KAIST); Junyeong Kim (Chung-Ang University); Chanwoo Kim (Samsung Electronics); Chang D. Yoo (KAIST)
IEEE Members: $11.00
09 Jun 2023
Video Corpus Moment Retrieval aims to select a temporal video moment pertinent to a given language query from a large video corpus. Existing systems are prone to rely on a retrieval bias as a shortcut, which hinders the systems from accurately learning vision-language association. The retrieval bias is spurious correlations between query and scene. For a given query, systems tend to retrieve incorrectly correlated scenes due to biased annotations that have predominant binding in a dataset. To this end, we present a Counterfactual Two-stage Debiasing Learning (CTDL), which incorporates a counterfactual bias network that intentionally learns the retrieval bias by providing a shortcut to learn the spurious correlation between keyword and scene, and performs two-stage debiasing learning that mitigates the bias via contrasting factual retrievals with counterfactually biased retrievals. Extensive experiments show the effectiveness of CTDL paradigm.