MDSTA: masked diffusion spatio-temporal autoencoder for multimodal remote sensing image classification

Deep learning methods have significantly advanced multimodal remote sensing data classification recently. However, challenges in data acquisition frequently lead to missing modalities, which substantially hinder the performance of these models. Diffusion models have shown tremendous potential in gen...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International journal of remote sensing Ročník 46; číslo 11; s. 4215 - 4240
Hlavní autoři: Yue, Zongqin, Xu, Jindong, Li, Ziyi, Xing, Haihua, Cheng, Xiang
Médium: Journal Article
Jazyk:angličtina
Vydáno: Taylor & Francis 03.06.2025
Témata:
ISSN:0143-1161, 1366-5901
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Deep learning methods have significantly advanced multimodal remote sensing data classification recently. However, challenges in data acquisition frequently lead to missing modalities, which substantially hinder the performance of these models. Diffusion models have shown tremendous potential in generative tasks, demonstrating a superior ability to model complex data distributions. However, in the remote sensing domain, which incorporates diverse modalities, each modality possesses unique information, and diffusion models may need help to effectively recover critical information when one modality is missing, resulting in unsatisfying performance. To address this, we propose the Masked Diffusion Spatio-Temporal Autoencoder (MDSTA) network for the joint classification of remote sensing data under arbitrary modalities. MDSTA consists of three main components: a Conditional Masked Diffusion Process (CMDP), a Reverse Diffusion Reconstruction Process (RDRP), and an Attention Multi-Layer Perceptron (MLP). The CMDP progressively adds noise to the remote sensing data to prepare for subsequent denoising and reconstruction. The RDRP extracts features from multimodal data through our designed Spatio-Temporal Fusion (STF) Encoder, mapping them into a shared-parameter modality-mixed space to capture multimodal shared features. The Masked Reconstruction (MR) Decoder then utilizes these features for independent reconstruction of each modality, which helps to learn the unique characteristics of each modality. Then, we designed an attention MLP to fuse multimodal classification tokens and obtain the final classification result. Furthermore, we introduce masked training in the conditional masked diffusion block to alleviate memory consumption. Comprehensive experimental findings on four datasets indicate that the proposed MDSTA model surpasses leading models in performance.
ISSN:0143-1161
1366-5901
DOI:10.1080/01431161.2025.2496532