Multi-Modal Transformer For Rgb-D Salient Object Detection
Peipei Song, Jing Zhang, Piotr Koniusz, Nick Barnes
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:09:37
in this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. in particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.