C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Andrew Rouditchenko (MIT CSAIL); Yung-Sung Chuang (MIT); Nina Shvetsova (Goethe University Frankfurt); Samuel Thomas (IBM Research AI); Rogerio Feris (MIT-IBM Watson AI Lab, IBM Research); Brian Kingsbury (IBM Research); Leonid Karlinsky (IBM-Research); David Harwath (The University of Texas at Austin); Hilde Kuehne (Goethe University Frankfurt); James Glass (Massachusetts Institute of Technology)

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

06 Jun 2023

Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for languages other than English still lags. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student’s text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.

Tags:

Machine learning methods for language

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Value-Added Bundle(s) Including this Product

IEEE ICASSP 2023, 4-10 June 2023, Greece. Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

Egocentric Action Anticipation for Personal Health

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

Join the IEEE Signal Processing Society