Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening

CHEN Junjie; WANG Tingting; FANG Faming; ZHANG Guixu

doi:10.11999/JEIT251252

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2026 >

CHEN Junjie, WANG Tingting, FANG Faming, ZHANG Guixu. Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251252

Citation:

CHEN Junjie, WANG Tingting, FANG Faming, ZHANG Guixu. Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT251252

Citation:

PDF( 2581 KB)

Semantic-guided Unified Multi-scale Deep Unrolling Network for Pansharpening

doi: 10.11999/JEIT251252 cstr: 32379.14.JEIT251252

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

Funds: The National Key Research and Development Program of China (2022ZD0161800), The National Natural Science Foundation of China, The Open Research Fund of KLATASDS-MOE, ECNU

Received Date: 2025-11-26
Accepted Date: 2026-04-17
Rev Recd Date: 2026-04-17

Available Online: 2026-05-03

Abstract

Abstract

Objective With the rapid advancement of satellite imaging technologies, the demand for high-resolution multispectral remote sensing imagery has grown substantially across a wide range of applications. Due to the wide variety of satellite platforms, there exists a significant domain shift across datasets collected from different satellites. As a result, most existing deep learning (DL)-based pansharpening methods are trained individually for each satellite dataset, and consequently exhibit limited generalization capability across different satellites. To address these limitations, this study proposes a Semantic-guided Unified Multi-scale Deep Unrolling Network (SUM-DUN), which is designed based on classical optimization theory, adopting a 3D multi-scale deep unfolding architecture for integrated feature extraction and fusion. Leveraging multimodal large language models (MLLMs), the proposed method derives semantic textual prompts from the input images, which direct the model to adaptively adjust its feature representations and thereby enhance fusion quality. The proposed method aims to achieve unified remote sensing image fusion through tailored network architecture and prompt-guided mechanisms, thereby providing reliable support for high-level image interpretation tasks. Methods Following the Maximum A Posteriori(MAP) estimation principle, the optimization process for HRMS recovery is unfolded into the proposed SUM-DUN(Fig. 1). Each iteration stage of SUM-DUN consists of two main modules: a Gradient Descent Module (GDM) and a Semantic-guided Proximal Mapping Network (SPMN), which are used to approximate the operations in Eq. (5) and Eq. (6), respectively. GDM performs a gradient descent update based on the current feature estimate and the degradation model. The SPMN, implemented with a Transformer-based architecture as illustrated in Fig. 2(b), incorporates semantic textual prompts generated from the input image pair by MLLMs. These prompts guide the network to adaptively select appropriate feature propagation strategies for the current pair, helping suppress noise and mitigate discrepancies across different satellite sensors. Moreover, leveraging upsampling and downsampling operations, the network transmits MS and PAN features between iterative stages, thereby progressively preserving and enhancing multi-scale spatial and spectral information throughout the unfolding process. Results and Discussions To demonstrate the effectiveness of the proposed method, we compare the method against seven representative baselines, including 2 traditional methods (BDSD and PRACS) and 5 DL–based methods (AWFLN, FusionMamba, PanMamba, WFANet and TMDiff). For the reduced resolution evaluation, where ground-truth HRMS images are available, we adopt several widely-used reference based metrics, including Spectral Angle Mapper (SAM), Spatial Correlation Coefficient (SCC), Peak Signal-to-Noise Ratio(PSNR), Erreur Relative Global Adimensionnelle de Synthèse (ERGAS), Averaged Universal Image Quality Index(QAVE) and the Universal Image Quality Index for 4-band and 8-band images. These metrics jointly evaluate spectral fidelity, spatial consistency, and overall image quality. For the full-resolution evaluation, where ground-truth HRMS are unavailable, we rely on no-reference quality indices. Specifically, we employ the Hybrid Quality with No Reference (HQNR) metric, along with its spectral distortion component and spatial distortion component, to assess the fusion quality in real-world scenarios. Quantitative evaluations on the GF-1, QB, WV-2, and WV-4 test datasets demonstrate that the proposed method consistently achieves either the best or second-best performance across all metrics, under both reduced-resolution and full-resolution settings(Table 2–3). These results clearly indicate that the proposed method is capable of simultaneously preserving spectral fidelity and spatial consistency, while maintaining robust performance across different satellites and remaining effective in more challenging scenarios. The ablation studies validate the effectiveness of the 3D architecture, the multi-scale network design, and the spatial–channel prompt guidance mechanism, as removing or altering any of these components leads to varying degrees of performance degradation(Table 4-5). Conclusions This study proposes a semantic-guided unified multi-scale deep unfolding method for pansharpening, which leverages semantic prompts generated by a MLLM to facilitate efficient and unified fusion of images from different satellites. The proposed approach is built upon a deep unfolding framework and employs a 3D convolutional architecture to accommodate varying numbers of spectral bands across satellite datasets. The multi-scale network design is further incorporated to extract spatial and spectral features at different levels, thereby enhancing the fusion capability. In addition, the sematic prompt integration module is introduced to adaptively route spatial and channel features based on the extracted semantic information, enabling more effective feature propagation and improving both spatial detail reconstruction and spectral consistency. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance in terms of both visual quality and quantitative evaluation metrics.
- Remote sensing image fusion,
- Unified pansharpening,
- Multimodal large language model,
- Deep unfolding network

FullText(HTML)

References(32)

References

[1]	THOMAS C, RANCHIN T, WALD L, et al. Synthesis of multispectral images to high spatial resolution: A critical review of fusion methods based on remote sensing physics[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(5): 1301–1312. doi: 10.1109/TGRS.2007.912448.
[2]	金晶, 王峰. 分布式多卫星协同遥感图像场景分类方法[J]. 电子与信息学报, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866. JIN Jing and WANG Feng. A distributed multi-satellite collaborative framework for remote sensing scene classification[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4677–4688. doi: 10.11999/JEIT250866.
[3]	文泓力, 胡庆浩, 黄立威, 等. 基于参数高效ViT与多模态导引的遥感图像小样本分类方法[J]. 电子与信息学报, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996. WEN Hongli, HU Qinghao, HUANG Liwei, et al. Few-shot remote sensing image classification based on parameter-efficient vision transformer and multimodal guidance[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4689–4703. doi: 10.11999/JEIT250996.
[4]	韩汶杞, 蒋雯, 耿杰, 等. 原型对齐与拓扑一致性约束下的多模态半监督遥感图像语义分割[J]. 电子与信息学报, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115. HAN Wenqi, JIANG Wen, GENG Jie, et al. PATC: Prototype alignment and topology-consistent pseudo-supervision for multimodal semi-supervised semantic segmentation of remote sensing images[J]. Journal of Electronics & Information Technology, 2025, 47(12): 4714–4727. doi: 10.11999/JEIT251115.
[5]	ZENG Delu, HU Yuwen, HUANG Yue, et al. Pan-sharpening with structural consistency and ℓ_1/2 gradient prior[J]. Remote Sensing Letters, 2016, 7(12): 1170–1179. doi: 10.1080/2150704X.2016.1222098.
[6]	WU Zhongcheng, HUANG Tingzhu, DENG Liangjian, et al. VO+Net: An adaptive approach using variational optimization and deep learning for panchromatic sharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5401016. doi: 10.1109/TGRS.2021.3066425.
[7]	LU Hangyuan, YANG Yong, HUANG Shuying, et al. AWFLN: An adaptive weighted feature learning network for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5400815. doi: 10.1109/TGRS.2023.3241643.
[8]	XIE Xinyu, CUI Yawen, TAN Tao, et al. FusionMamba: Dynamic feature enhancement for multimodal image fusion with mamba[J]. Visual Intelligence, 2024, 2(1): 37. doi: 10.1007/s44267-024-00072-9.
[9]	HE Xuanhua, CAO Ke, ZHANG Jie, et al. Pan-mamba: Effective pan-sharpening with state space model[J]. Information Fusion, 2025, 115: 102779. doi: 10.1016/j.inffus.2024.102779.
[10]	HUANG Jie, HUANG Rui, XU Jinghao, et al. Wavelet-assisted multi-frequency attention network for pansharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, 2025: 3662–3670. doi: 10.1609/aaai.v39i4.32381.
[11]	JIA Menglin, TANG Luming, CHEN B C, et al. JIA Menglin, TANG Luming, CHEN B C, et al. Visual prompt tuning[C]. 17th European Conference on Computer Vision, Tel-Aviv, Israel, 2022: 709–727. doi: 10.1007/978-3-031-19827-4_41.
[12]	NIE Xing, NI Bolin, CHANG Jianlong, et al. Pro-tuning: Unified prompt tuning for vision tasks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(6): 4653–4667. doi: 10.1109/TCSVT.2023.3327605.
[13]	CUI Yuning, ZAMIR S W, KHAN S, et al. AdaIR: Adaptive all-in-one image restoration via frequency mining and modulation[C]. The Thirteenth International Conference on Learning Representations, Singapore, Singapore, 2025: 57335–57356. (查阅网上资料, 未找到本条文献页码信息, 请确认).
[14]	ZENG Haijin, WANG Xiangming, CHEN Yongyong, et al. Vision-language gradient descent-driven all-in-one deep unfolding networks[C]. Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, United States, 2025: 7524–7533. doi: 10.1109/CVPR52734.2025.00705.
[15]	YANG Zhiwen, CHEN Haowei, QIAN Ziniu, et al. All-in-one medical image restoration via task-adaptive routing[C]. 27th International Conference on Medical Image Computing and Computer Assisted Intervention, Marrakesh, Morocco, 2024: 67–77. doi: 10.1007/978-3-031-72104-5_7.
[16]	XING Yinghui, QU Litao, ZHANG Shizhou, et al. Empower generalizability for pansharpening through text-modulated diffusion model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5633812. doi: 10.1109/TGRS.2024.3434431.
[17]	LI Xueheng, HE Xuanhua, CAO Ke, et al. Exploring text-guided information fusion through chain-of-reasoning for pansharpening[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5407314. doi: 10.1109/TGRS.2025.3604447.
[18]	FANG Shijie and GAN Hongping. SSUN-net: Spatial-spectral prior-aware unfolding network for pan-sharpening[C]. Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, United States, 2025: 2897–2905. doi: 10.1609/aaai.v39i3.32296.
[19]	YAN Qiuhai, JIANG Aiwen, CHEN Kang, et al. Textual prompt guided image restoration[J]. Engineering Applications of Artificial Intelligence, 2025, 155: 110981. doi: 10.1016/j.engappai.2025.110981.
[20]	CONDE M V, GEIGLE G, and TIMOFTE R. InstructIR: High-quality image restoration following human instructions[C]. 18th European Conference on Computer Vision, Milan, Italy, 2024: 1–21. doi: 10.1007/978-3-031-72764-1_1.
[21]	ZENG Aohan, XU Bin, WANG Bowen, et al. ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools[EB/OL]. https://arxiv.org/abs/2406.12793, 2024.
[22]	XIAO Shitao, LIU Zheng, ZHANG Peitian, et al. C-pack: Packed resources for general Chinese embeddings[C]. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, USA, 2024: 641–649. doi: 10.1145/3626772.3657878.
[23]	MENG Xiangchao, XIONG Yiming, SHAO Feng, et al. A large-scale benchmark data set for evaluating pansharpening performance: Overview and implementation[J]. IEEE Geoscience and Remote Sensing Magazine, 2021, 9(1): 18–52. doi: 10.1109/MGRS.2020.2976696.
[24]	WALD L, RANCHIN T, and MANGOLINI M. Fusion of satellite images of different spatial resolutions: Assessing the quality of resulting images[J]. Photogrammetric Engineering and Remote Sensing, 1997, 63(6): 691–699.
[25]	GARZELLI A, NENCINI F, and CAPOBIANCO L. Optimal MMSE pan sharpening of very high resolution multispectral images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2008, 46(1): 228–236. doi: 10.1109/TGRS.2007.907604.
[26]	CHOI J, YU Kiyun, and KIM Y. A new adaptive component-substitution-based satellite image fusion by using partial replacement[J]. IEEE Transactions on Geoscience and Remote Sensing, 2011, 49(1): 295–309. doi: 10.1109/TGRS.2010.2051674.
[27]	VIVONE G, ALPARONE L, CHANUSSOT J, et al. A critical comparison among pansharpening algorithms[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015, 53(5): 2565–2586. doi: 10.1109/TGRS.2014.2361734.
[28]	ZHOU Jie, CIVCO D L, and SILANDER J A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data[J]. International Journal of Remote Sensing, 1998, 19(4): 743–757. doi: 10.1080/014311698215973.
[29]	WALD L. Data Fusion. Definitions and Architectures - Fusion of Images of Different Spatial Resolutions[M]. Paris, France: Presses de l’École, Ecole des Mines de Paris, 2002: 165–189.
[30]	WANG Zhou and BOVIK A C. A universal image quality index[J]. IEEE Signal Processing Letters, 2002, 9(3): 81–84. doi: 10.1109/97.995823.
[31]	GARZELLI A and NENCINI F. Hypercomplex quality assessment of multi/hyperspectral images[J]. IEEE Geoscience and Remote Sensing Letters, 2009, 6(4): 662–665. doi: 10.1109/LGRS.2009.2022650.
[32]	ARIENZO A, VIVONE G, GARZELLI A, et al. Full-resolution quality assessment of pansharpening: Theoretical and hands-on approaches[J]. IEEE Geoscience and Remote Sensing Magazine, 2022, 10(3): 168–201. doi: 10.1109/MGRS.2022.3170092.