Multi-Modal Transformer For Rgb-D Salient Object Detection

Peipei Song, Jing Zhang, Piotr Koniusz, Nick Barnes

DOI

SPS

Members: Free
IEEE Members: $11.00
Non-members: $15.00

Length: 00:09:37

03 Oct 2022

in this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. in particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.

Tags:

International Conference on Image Processing

IEEE ICIP 2022

icip