Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition
Zi-Qiang Zhang, Jie Zhang, Jian-Shu Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:08:32
With the considerable advancement of remote sensing technology and computer vision, automatic scene understanding for very high-resolution aerial (VHR) imagery became a necessary research topic. Semantic segmentation of VHR imagery is an important task where context information plays a crucial role. Adequate feature delineation is difficult due to high-class imbalance in remotely sensed data. in this work, we proposed a variant of encoder-decoder-based architecture where residual attentive skip connections are incorporated. We added a multi-context block in each of the encoder units to capture multi-scale and multi-context features and used dense connections for effective feature extraction. A comprehensive set of experiments reveal that the proposed scheme outperformed recently published work by 3% in overall accuracy and F1 score for ISPRS Vaihingen and ISPRS Potsdam benchmark datasets.