Advanced Search
Turn off MathJax
Article Contents
XIE Wen, ZHU Chaotao, WANG Jin, MA Xiaomeng. Remote Sensing Land-Cover Classification Combining Multi-Modal and Multi-Scale Fusion with Mamba[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251303
Citation: XIE Wen, ZHU Chaotao, WANG Jin, MA Xiaomeng. Remote Sensing Land-Cover Classification Combining Multi-Modal and Multi-Scale Fusion with Mamba[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251303

Remote Sensing Land-Cover Classification Combining Multi-Modal and Multi-Scale Fusion with Mamba

doi: 10.11999/JEIT251303 cstr: 32379.14.JEIT251303
Funds:  The National Natural Science Foundation of China(61901365, 62071379), The Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2025JC-YBQN-936), Scientific Research Program Funded by Education Department of Shaanxi Provincial Government (Program No.25JP175), The Youth Innovation Team of Shaanxi Universities, The New Star Team of Xi’an University of Posts and Telecommunication (xyt2016-01)
  • Accepted Date: 2026-04-17
  • Rev Recd Date: 2026-04-17
  • Available Online: 2026-05-03
  •   Objective   The rapid development of remote sensing imaging technology has generated massive and diverse data for Remote Sensing Land-Cover Classification. In recent years, Mamba-based models have achieved successful applications in image processing owing to their distinctive architectures and powerful global modeling capabilities. Among them, multi-scale vision Mamba models demonstrate proficiency in handling complex spatial distributions, which aligns well with the characteristics of remote sensing scenes, including significant scale variations and complex orientations of ground objects. To fully exploit the advantages of the Mamba models in extracting and fusing features from remote sensing data, The Mamba-based Multi-Modal and Multi-Scale Fusion Model for Remote Sensing Land-Cover Classification (M3RS) is proposed.  Methods   The proposed model, M3RS, mainly consists of three stages to extract and fuse features. Firstly, the model employs a Multi-Scale Spatial Encoder based on Spatial Mamba to extract features from Light Detection And Ranging (LiDAR) images and Synthetic Aperture Radar(SAR) images. Due to the unique data structure of the HyperSpectral Image(HSI), a Multi-Scale Spatio-Spectral Encoder is proposed to extract the complex spatial-spectral features using Spatial Mamba and Spectral Mamba. Next, a Multi-Modal Feature Fusion Module including the proposed Cross-Mamba and Channel-Concatenated Mamba is introduced to fuse multimodal features. Cross-Mamba efficiently fuses multimodal spatial features by interacting with multimodal state space parameters, while Channel-Concatenated Mamba fully fuses multimodal features by constructing four channel scanning methods. Finally, the model adopts an improved Multi-Scale Feature Fusion Module to fuse multiscale features layer by layer, thereby obtaining highly discriminative classification evidence that can effectively improve the accuracy of Remote Sensing Land-Cover Classification.  Results and Discussions   Comparative experiments are conducted on three publicly available multimodal remote sensing land-cover classification datasets to evaluate the classification performance of the proposed model against seven mainstream models. The experimental results demonstrate that the proposed model significantly outperforms its counterparts in terms of Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient. Specifically, on the Muufl dataset, the OA of the proposed model is 3.49%, 3.80%, and 4.02% higher than those of models based on CNN, Transformer and Mamba, respectively (Table. 2, Fig. 8). Furthermore, on the Houston2013 and Augsburg datasets, the OA of the proposed model surpasses all comparative algorithms by an average of 3.37% and 3.11%, respectively (Table. 3, Table. 4). The results indicate that the integration of a Multi-Modal Multi-Scale architecture with the Mamba model effectively enhances the accuracy of Remote Sensing Land-Cover Classification. In addition, an ablation experiment in this paper validates the contribution of each proposed module to improving classification accuracy (Table. 5). While Spectral Mamba significantly improves the accuracy, several fusion modules also make contributions to the overall performance to different degrees. Then, the hyperparameter experiment offers valuable hyperparameter configurations for multiscale remote sensing image fusion (Table. 6). Finally, compared with the Transformer model employing an identical multi-scale architecture, this Mamba model not only achieves improved classification accuracy but also reduces the parameter count by 37.4% and shortens the training time by 10.7%, reflecting the dual improvements in both accuracy and efficiency of the Mamba model (Fig. 9).  Conclusions   The proposed M3RS employs the Mamba model to fuse multimodal and multiscale features, effectively enhancing the performance of Remote Sensing Land-Cover Classification. Firstly, different encoders utilized in M3RS effectively address the disparities among multimodal data, thereby providing richer multimodal complementary information for fusion and classification. Subsequently, the proposed Cross-Mamba and Channel-Concatenation Mamba take the similarities and differences between Mamba and Transformer into account and respectively achieve efficient multimodal spatial feature interaction and comprehensive multimodal feature fusion, providing a hierarchical multimodal fusion approach. Moreover, the multiscale architecture overcomes the complex spatial distribution issues of remote sensing land covers, to a certain extent. And the proposed Multi-Scale Feature Fusion Module composed of Spatial Mamba and channel attention effectively integrates multiscale features and provides a reliable basis for the following classification. Based on this work, future research will continue to optimize the model by exploring the underlying principles of Mamba and conduct in-depth investigation into cross-attention mechanisms to refine the feature alignment process in multimodal interaction and ensure the reliability of feature fusion.
  • loading
  • [1]
    李树涛, 李聪妤, 康旭东. 多源遥感图像融合发展现状与未来展望[J]. 遥感学报, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259.

    LI Shutao, LI Congyu, and KANG Xudong. Development status and future prospects of multi-source remote sensing image fusion[J]. National Remote Sensing Bulletin, 2021, 25(1): 148–166. doi: 10.11834/jrs.20210259.
    [2]
    HANG Renlong, LI Zhu, GHAMISI P, et al. Classification of hyperspectral and LiDAR data using coupled CNNs[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(7): 4939–4950. doi: 10.1109/TGRS.2020.2969024.
    [3]
    REN Bo, HUA Chaoyue, HOU Biao, et al. PDCNet: A Polarimetric data-enhanced contrastive learning network for PolSAR land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025, 18: 10010–10025. doi: 10.1109/JSTARS.2025.3557252.
    [4]
    REN Bo, WANG Zhao, GE Hanyuan, et al. Incremental land cover classification via soft label and subregion distillation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5647322. doi: 10.1109/TGRS.2025.3615670.
    [5]
    LI Shutao, SONG Weiwei, FANG Leyuan, et al. Deep learning for hyperspectral image classification: An overview[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(9): 6690–6709. doi: 10.1109/TGRS.2019.2907932.
    [6]
    MA Xianping, ZHANG Xiaokang, and PUN M Q. RS3Mamba: Visual state space model for remote sensing image semantic segmentation[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 6011405. doi: 10.1109/LGRS.2024.3414293.
    [7]
    刘晓敏, 余梦君, 乔振壮, 等. 面向多源遥感数据分类的尺度自适应融合网络[J]. 电子与信息学报, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178.

    LIU Xiaomin, YU Mengjun, QIAO Zhenzhuang, et al. Scale adaptive fusion network for multimodal remote sensing data classification[J]. Journal of Electronics & Information Technology, 2024, 46(9): 3693–3702. doi: 10.11999/JEIT240178.
    [8]
    廖帝灵, 赖涛, 黄海风, 等. LightMamba: 一种轻量级Mamba用于高光谱图形和激光雷达数据联合分类网络[J]. 电子与信息学报, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981.

    LIAO Diling, LAI Tao, HUANG Haifeng, et al. LightMamba: A lightweight mamba network for the joint classification of HSI and LiDAR data[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4937–4947. doi: 10.11999/JEIT250981.
    [9]
    LAPARRA V, MALO J, and CAMPS-VALLS G. Dimensionality reduction via regression in hyperspectral imagery[J]. IEEE Journal of Selected Topics in Signal Processing, 2015, 9(6): 1026–1036. doi: 10.1109/JSTSP.2015.2417833.
    [10]
    MELGANI F and BRUZZONE L. Support vector machines for classification of hyperspectral remote-sensing images[C]. IEEE International Geoscience and Remote Sensing Symposium, Toronto, Canada, 2002: 506–508. doi: 10.1109/IGARSS.2002.1025088.
    [11]
    ZHOU Hao, LUO Fulin, ZHUANG Huiping, et al. Attention multihop graph and multiscale convolutional fusion network for hyperspectral image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5508614. doi: 10.1109/TGRS.2023.3265879.
    [12]
    ZHAO Linying and JI Shunping. CNN, RNN, or VIT? An evaluation of different deep learning architectures for spatio-temporal representation of sentinel time series[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 44–56. doi: 10.1109/JSTARS.2022.3219816.
    [13]
    LU Ting, DING Kexin, FU Wei, et al. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data[J]. Information Fusion, 2023, 93: 118–131. doi: 10.1016/j.inffus.2022.12.020.
    [14]
    XU Xiaodong, LI Wei, RAN Qiong, et al. Multisource remote sensing data classification based on convolutional neural network[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(2): 937–949. doi: 10.1109/TGRS.2017.2756851.
    [15]
    WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. WANG Jinzhe, ZHANG Junping, GUO Qingle, et al. Fusion of hyperspectral and LiDAR data based on dual-branch convolutional neural network[C]. Proceedings of the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 2019: 3388–3391. doi: 10.1109/IGARSS.2019.8899332.
    [16]
    WU Xin, HONG Danfeng, and CHANUSSOT J. Convolutional neural networks for multimodal remote sensing data classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5517010. doi: 10.1109/TGRS.2021.3124913.
    [17]
    VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000–6010.
    [18]
    DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16 × 16 words: Transformers for image recognition at scale[C]. Proceedings of the 9th International Conference on Learning Representations, 2021. (查阅网上资料, 未找到对应的出版地及页码信息, 请确认补充).
    [19]
    XUE Zhixiang, TAN Xiong, YU Xuchu, et al. Deep hierarchical vision transformer for hyperspectral and LiDAR data classification[J]. IEEE Transactions on Image Processing, 2022, 31: 3095–3110. doi: 10.1109/TIP.2022.3162964.
    [20]
    ROY S K, DERIA A, HONG Danfeng, et al. Multimodal fusion transformer for remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5515620. doi: 10.1109/TGRS.2023.3286826.
    [21]
    YAO Jing, ZHANG Bing, LI Chenyu, et al. Extended Vision Transformer (ExViT) for land use and land cover classification: A multimodal deep learning framework[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5514415. doi: 10.1109/TGRS.2023.3284671.
    [22]
    ZHAO Guangrui, YE Qiaolin, SUN Le, et al. Joint classification of hyperspectral and LiDAR data using a hierarchical CNN and transformer[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5500716. doi: 10.1109/TGRS.2022.3232498.
    [23]
    ROY S K, SUKUL A, JAMALI A, et al. Cross hyperspectral and LiDAR attention transformer: An extended self-attention for land use and land cover classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5512815. doi: 10.1109/TGRS.2024.3374324.
    [24]
    SUN Le, WANG Xinyu, ZHENG Yuhui, et al. Multiscale 3-D–2-D mixed CNN and lightweight attention-free transformer for hyperspectral and LiDAR classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 2100116. doi: 10.1109/TGRS.2024.3367374.
    [25]
    SMITH J T H, WARRINGTON A, and LINDERMAN S W. Simplified state space layers for sequence modeling[C]. Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023: 1–13.
    [26]
    GU A and DAO T. Mamba: Linear-time sequence modeling with selective state spaces[EB/OL]. https://arxiv.org/abs/2312.00752, 2024.
    [27]
    ZHU Lianghui, LIAO Bencheng, ZHANG Qian, et al. Vision mamba: Efficient visual representation learning with bidirectional state space model[C]. Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 2024.
    [28]
    LIU Yue, TIAN Yunjie, ZHAO Yuzhong, et al. VMamba: Visual state space model[C]. Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, 2024: 3273.
    [29]
    CHEN Keyan, CHEN Bowen, LIU Chenyang, et al. RSMamba: Remote sensing image classification with state space model[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 8002605. doi: 10.1109/LGRS.2024.3407111.
    [30]
    LIAO Diling, WANG Qingsong, LAI Tao, et al. Joint classification of hyperspectral and LiDAR data based on mamba[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5530915. doi: 10.1109/TGRS.2024.3459709.
    [31]
    GAO Feng, JIN Xuepeng, ZHOU Xiaowei, et al. MSFMamba: Multiscale feature fusion state space model for multisource remote sensing image classification[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5504116. doi: 10.1109/TGRS.2025.3535622.
    [32]
    刁文辉, 龚铄, 辛林霖, 等. 针对多模态遥感数据的自监督策略模型预训练方法[J]. 电子与信息学报, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016.

    DIAO Wenhui, GONG Shuo, XIN Linlin, et al. A model pre-training method with self-supervised strategies for multimodal remote sensing data[J]. Journal of Electronics & Information Technology, 2025, 47(6): 1658–1668. doi: 10.11999/JEIT241016.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(9)  / Tables(5)

    Article Metrics

    Article views (19) PDF downloads(5) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return