A Model Pre-training Method with Self-Supervised Strategies for Multimodal Remote Sensing Data

DIAO Wenhui; GONG Shuo; XIN Linlin; SHEN Zhiping; SUN Chao

doi:10.11999/JEIT241016

Volume 47 Issue 6

Jun. 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(6): 1658-1668

DIAO Wenhui, GONG Shuo, XIN Linlin, SHEN Zhiping, SUN Chao. A Model Pre-training Method with Self-Supervised Strategies for Multimodal Remote Sensing Data[J]. Journal of Electronics & Information Technology, 2025, 47(6): 1658-1668. doi: 10.11999/JEIT241016

Citation:

DIAO Wenhui, GONG Shuo, XIN Linlin, SHEN Zhiping, SUN Chao. A Model Pre-training Method with Self-Supervised Strategies for Multimodal Remote Sensing Data[J]. Journal of Electronics & Information Technology, 2025, 47(6): 1658-1668. doi: 10.11999/JEIT241016

Citation:

PDF( 8194 KB)

A Model Pre-training Method with Self-Supervised Strategies for Multimodal Remote Sensing Data

doi: 10.11999/JEIT241016 cstr: 32379.14.JEIT241016

DIAO Wenhui^{1, 4
,
,},
GONG Shuo^{1, 2, 3, 4},
XIN Linlin^{1, 2, 3, 4},
SHEN Zhiping^{1, 4},
SUN Chao^{1, 4}

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
2.
University of Chinese Academy of Sciences, Beijing 100190, China
3.
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
4.
Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

Received Date: 2024-11-13
Rev Recd Date: 2025-05-20

Available Online: 2025-05-28

Publish Date: 2025-06-30

Abstract

Abstract

Objective With the advancement of the remote sensing field and large model technologies, self-supervised learning enables model training on unlabeled remote sensing data through a mask-and-reconstruction approach. However, existing masking strategies primarily focus on spatial feature modeling while overlooking spectral feature modeling, resulting in an insufficient exploitation of spectral dimension information in spectral data. To address these challenges, this paper explores the imaging mechanisms and data characteristics of remote sensing and constructs a foundational pretraining model for self-supervised learning that supports multimodal remote sensing image data input, thereby providing a new approach for pretraining on multimodal remote sensing image data. Methods By exploring the imaging mechanisms and data characteristics of remote sensing, this paper constructs a foundational pretraining model for self-supervised learning based on Masked AutoEncoders (MAE) that supports the input of Synthetic Aperture Radar (SAR), Light Detection And Ranging (LiDAR), and HyperSpectral Imaging (HSI) data. The model employs a spatial branch that randomly masks pixel blocks to reconstruct missing pixels, and a spectral branch that randomly masks spectral channels to reconstruct the missing frequency information. This dual-branch design enables the model to effectively capture both spatial and spectral features of multimodal remote sensing image data, thereby improving the accuracy of pixel-level land cover classification. Results and Discussions The model was evaluated on land cover classification tasks using two publicly available datasets: the Berlin dataset and the Houston dataset. The experimental results demonstrate that the dual-channel attention mechanism more effectively extracts features from multimodal remote sensing image data. Through iterative parameter tuning, the model determined optimal hyperparameters tailored to each dataset. Compared to mainstream self-supervised learning methods such as BYOL, SimCLR, and SimCLRv2, our model achieved improvements in land cover classification accuracy of 1.98% on the Berlin dataset (Table.3, Fig.7) and 2.49% on the Houston dataset (Table.4, Fig.8), respectively. Conclusions This paper proposes a model for multimodal remote sensing image data classification, which comprises two main components: a spatial branch and a spectral branch. The spatial branch is designed to process the spatial information of images by applying masking to randomly selected image patches and reconstructing the missing pixels, thereby enhancing the model’s understanding of spatial structures. The spectral branch performs masking on randomly selected spectral channels with the goal of reconstructing the missing spectral responses, effectively leveraging the spectral dimension of hyperspectral data. Experimental results indicate that the proposed model can efficiently extract and utilize both spatial and spectral information, leading to a significant improvement in classification accuracy.
- Deep learning,
- Self-supervised training,
- Multimodal remote sensing imagery,
- Land cover classification

FullText(HTML)

References(24)

References

[1]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. The 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
[2]	LIU Quanyong, PENG Jiangtao, CHEN Na, et al. Category-specific prototype self-refinement contrastive learning for few-shot hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5524416. doi: 10.1109/TGRS.2023.3317077.
[3]	SHI Junfei and JIN Haiyan. Riemannian nearest-regularized subspace classification for polarimetric SAR images[J]. IEEE Geoscience and Remote Sensing Letters, 2022, 19: 4028605. doi: 10.1109/LGRS.2022.3224556.
[4]	LUO Fulin, ZHOU Tianyuan, LIU Jiamin, et al. Multiscale diff-changed feature fusion network for hyperspectral image change detection[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5502713. doi: 10.1109/TGRS.2023.3241097.
[5]	GUO Tan, WANG Ruizhi, LUO Fulin, et al. Dual-view spectral and global spatial feature fusion network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5512913. doi: 10.1109/TGRS.2023.3277467.
[6]	WU Ke, FAN Jiayuan, YE Peng, et al. Hyperspectral image classification using spectral–spatial token enhanced transformer with hash-based positional embedding[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5507016. doi: 10.1109/TGRS.2023.3258488.
[7]	LIU Guangyuan, LI Yangyang, CHEN Yanqiao, et al. Pol-NAS: A neural architecture search method with feature selection for PolSAR image classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 9339–9354. doi: 10.1109/JSTARS.2022.3217047.
[8]	HE Kaiming, FAN Haoqi, WU Yuxin, et al. Momentum contrast for unsupervised visual representation learning[C]. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020: 9726–9735. doi: 10.1109/CVPR42600.2020.00975.
[9]	CHEN Ting, KORNBLITH S, NOROUZI M, et al. A simple framework for contrastive learning of visual representations[C]. The 37th International Conference on Machine Learning, Vienna, Austria, 2020: 149.
[10]	HE Kaiming, CHEN Xinlei, XIE Saining, et al. Masked autoencoders are scalable vision learners[C]. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 15979–15988. doi: 10.1109/CVPR52688.2022.01553.
[11]	CONG Yezhen, KHANNA S, MENG Chenlin, et al. SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery[C]. The 36th Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 197–211.
[12]	ROY S K, DERIA A, HONG Danfeng, et al. Multimodal fusion transformer for remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5515620. doi: 10.1109/TGRS.2023.3286826.
[13]	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[C]. The 9th International Conference on Learning Representations, 2021.
[14]	JIANG Jiarui, HUANG Wei, ZHANG Miao, et al. Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization[C]. The 38th Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 135464–135625.
[15]	ROUET-LEDUC B and HULBERT C. Automatic detection of methane emissions in multispectral satellite imagery using a vision transformer[J]. Nature Communications, 2024, 15(1): 3801. doi: 10.1038/s41467-024-47754-y.
[16]	YANG Bin, WANG Xuan, XING Ying, et al. Modality fusion vision transformer for hyperspectral and LiDAR data collaborative classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 17052–17065. doi: 10.1109/JSTARS.2024.3415729.
[17]	LIU Ze, LIN Yutong, CAO Yue, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 9992–10002. doi: 10.1109/ICCV48922.2021.00986.
[18]	CHEN C F R, FAN Quanfu, and PANDA R. CrossViT: Cross-attention multi-scale vision transformer for image classification[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 347–356. doi: 10.1109/ICCV48922.2021.00041.
[19]	GRAHAM B, EL-NOUBY A, TOUVRON H, et al. LeViT: A vision transformer in ConvNet’s clothing for faster inference[C]. 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021: 12239–12249. doi: 10.1109/ICCV48922.2021.01204.
[20]	ZHANG Yuwen, PENG Yishu, TU Bing, et al. Local information interaction transformer for hyperspectral and LiDAR data classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 1130–1143. doi: 10.1109/JSTARS.2022.3232995.
[21]	ZHANG Jinli, CHEN Ziqiang, JI Yuanfa, et al. A multi-branch feature fusion model based on convolutional neural network for hyperspectral remote sensing image classification[J]. International Journal of Advanced Computer Science and Applications (IJACSA), 2023, 14(6): 147–156. doi: 10.14569/IJACSA.2023.0140617.
[22]	WANG Jinping, LI Jun, SHI Yanli, et al. AM³Net: Adaptive mutual-learning-based multimodal data fusion network[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(8): 5411–5426. doi: 10.1109/TCSVT.2022.3148257.
[23]	GRILL J B, STRUB F, ALTCHÉ F, et al. Bootstrap your own latent a new approach to self-supervised learning[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 1786.
[24]	CHEN Ting, KORNBLITH S, SWERSKY K, et al. Big self-supervised models are strong semi-supervised learners[C]. The 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2020: 1865.