Rethinking Transformers for Semantic Segmentation of Remote Sensing Images

Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmen...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE transactions on geoscience and remote sensing Ročník 61; s. 1 - 15
Hlavní autoři:	Liu, Yuheng, Zhang, Yifan, Wang, Ye, Mei, Shaohui
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	New York IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	Ablation Aggregation Artificial neural networks Coders Context Current transformers Decoding Encoder–decoder structure Feature extraction global-local transformer Image acquisition Image processing Image segmentation Information processing Modelling Modules Neural networks Remote sensing remote sensing (RS) Representations Semantic segmentation Semantics Task analysis Visualization
ISSN:	0196-2892, 1558-0644
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS .
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2023.3302024