Skip to main content

Multi-Modal Transformer For Rgb-D Salient Object Detection

Peipei Song, Jing Zhang, Piotr Koniusz, Nick Barnes

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:09:37
03 Oct 2022

in this paper, we propose a multi-modal transformer model for end-to-end training of video-grounded dialogue generation. in particular, LayerScale regularized spatio-temporal self-attention blocks are first introduced to enable us to flexibly train end-to-end from both video and image data, without extracting offline visual features. Further, a pre-trained generative language architecture BART is employed to encode different modalities and perform dialogue generation. Extensive experiments on Audio-Visual Scene-Aware Dialog (AVSD) dataset demonstrate its effectiveness and superiority to the state-of-the-art methods.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00