Feature Guided Masked Autoencoder for Self-supervised Learning in Remote Sensing

Self-supervised learning guided by masked image modeling, such as Masked AutoEncoder (MAE), has attracted wide attention for pretraining vision transformers in remote sensing. However, MAE tends to excessively focus on pixel details, limiting the model's capacity for semantic understanding, par...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE journal of selected topics in applied earth observations and remote sensing Ročník 18; s. 1 - 17
Hlavní autori: Wang, Yi, Hernandez, Hugo Hernandez, Albrecht, Conrad M, Zhu, Xiao Xiang
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Piscataway IEEE 01.01.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:1939-1404, 2151-1535
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Self-supervised learning guided by masked image modeling, such as Masked AutoEncoder (MAE), has attracted wide attention for pretraining vision transformers in remote sensing. However, MAE tends to excessively focus on pixel details, limiting the model's capacity for semantic understanding, particularly for noisy Synthetic Aperture Radar (SAR) images. In this paper, we explore spectral and spatial remote sensing image features as improved MAE-reconstruction targets. We first conduct a study on reconstructing various image features, all performing comparably well or better than raw pixels. Based on such observations, we propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Gradients (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images. Experimental results on three downstream tasks illustrate the effectiveness of FG-MAE with a particular boost for SAR imagery (e.g. up to 5% better than MAE on EuroSAT-SAR). Furthermore, we demonstrate the well-inherited scalability of FG-MAE and release a first series of pretrained vision transformers for medium-resolution SAR and multispectral images.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1939-1404
2151-1535
DOI:10.1109/JSTARS.2024.3493237