Skip to main content

Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition

Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:08:32
04 Oct 2022

With the considerable advancement of remote sensing technology and computer vision, automatic scene understanding for very high-resolution aerial (VHR) imagery became a necessary research topic. Semantic segmentation of VHR imagery is an important task where context information plays a crucial role. Adequate feature delineation is difficult due to high-class imbalance in remotely sensed data. in this work, we proposed a variant of encoder-decoder-based architecture where residual attentive skip connections are incorporated. We added a multi-context block in each of the encoder units to capture multi-scale and multi-context features and used dense connections for effective feature extraction. A comprehensive set of experiments reveal that the proposed scheme outperformed recently published work by 3% in overall accuracy and F1 score for ISPRS Vaihingen and ISPRS Potsdam benchmark datasets.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00