VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG

Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao, Rui Wang

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:07:15

09 May 2022

The visual dialog task attempts to train an agent to answer multi-turn questions given an image, which requires the deep understanding of interactions between the image and dialog history. Existing researches tend to employ the modality-specific modules to model the interactions, which might be troublesome to use. To fill in this gap, we propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding firstly in visual dialog tasks to simplify the model. The model is trained over two tasks: masked language modeling and next utterance retrieval. These tasks help in learning visual concepts, utterances dependence, and the relationships between these two modalities. Finally, our VU-BERT achieves competitive performance (0.7287 NDCG scores) on VisDial v1.0 Datasets.

Tags:

transformer

multi-modal

patch embedding

visual dialog

VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG

Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Jing Xiao, Rui Wang

Value-Added Bundle(s) Including this Product

ICASSP 2022, May 2022 Virtual and In-Person Conference - Presentation Videos Product Bundle

More Like This

Short Course Bundle: ICIP 2023 COURSE 1: Short Course: Multimodal Learning: Technical Foundation, Hands-on and Applications (Parts 1-4)

Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Slides: Devising Transformers as an Autoencoder for Unsupervised Multivariate Time Series Imputation

Join the IEEE Signal Processing Society