视觉-语言基础模型驱动的遥感图文检索研究进展

武辉; 赵妍; 张培溶; 侯英妍; 齐析屿; 王磊

doi:10.11999/JEIT260189

视觉-语言基础模型驱动的遥感图文检索研究进展

doi: 10.11999/JEIT260189 cstr: 32379.14.JEIT260189

武辉^{1, 2, 3, 4},
赵妍^{1, 2, 3, 4},
张培溶^{1, 2, 3, 4},
侯英妍^{1, 2},
齐析屿^{1, 2},
王磊^{1, 2, 3, ,}

1.
中国科学院空天信息创新研究院北京 100094
2.
目标认知与应用技术重点实验室北京 100094
3.
中国科学院大学北京 100190
4.
中国科学院大学电子电气与通信工程学院北京 100190

基金项目: 中国科学院重点部署项目资助(基金号：KGFZD-145-25-38和RCJJ-145-24-13)；空天院科学与颠覆性技术项目资助(基金号：AIRCAS2024-AIRCAS-SDTP-03)

详细信息

作者简介:
武辉：男，博士生，研究方向为大语言模型、检索增强生成

赵妍：女，博士生，研究方向为遥感多模态检索

张培溶：女，博士生，研究方向为遥感多模态目标识别

侯英妍：女，助理研究员，研究方向为多模态大模型、遥感视觉推理

齐析屿：男，助理研究员，研究方向为遥感小样本目标智能解译

王磊：男，研究员，研究方向为多任务遥感图像解译、多模态遥感目标跟踪

通讯作者:
王磊　wanglei002931@aircas.ac.cn

中图分类号: TP391.41
计量
- 文章访问数: 18
- HTML全文浏览量: 1
- PDF下载量: 2
- 被引次数: 0
出版历程
- 收稿日期: 2026-02-13
- 修回日期: 2026-04-09
- 录用日期: 2026-04-09
- 网络出版日期: 2026-04-28

Recent Advances in Remote Sensing Image-Text Retrieval Driven by Vision–Language Foundation Models

WU Hui^{1, 2, 3, 4},
ZHAO Yan^{1, 2, 3, 4},
ZHANG Peirong^{1, 2, 3, 4},
HOU Yingyan^{1, 2},
QI Xiyu^{1, 2},
WANG Lei^{1, 2, 3
, ,}

1.
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
2.
Key Laboratory of Target Cognition and Application Technology, Beijing 100094, China
3.
University of Chinese Academy of Sciences, Beijing 100190, China
4.
School of Electronic, Electrical and Communication Engineering, UCAS, Beijing 100190, China

Funds: Key Program of Chinese Academy of Sciences under Grant KGFZD-145-25-38 and RCJJ-145-24-13;Science and Disruptive Technology Program under Grant AIRCAS2024-AIRCAS-SDTP-03.

摘要

摘要: 遥感图文检索(Remote Sensing Image-Text Retrieval, RS-TIR)通过建立遥感影像与自然语言描述之间的跨模态语义关联，为海量地理空间数据的语义理解与智能检索提供了重要支撑。随着高分辨率对地观测数据持续积累，复杂场景、多尺度结构、专业语义表达及标注稀缺等因素，使传统手工特征方法和常规深度跨模态模型在语义建模、跨场景泛化和开放环境适应方面受到明显制约。围绕视觉-语言基础模型(Vision-Language Models, VLM)驱动的遥感图文检索研究，系统梳理了任务建模、领域挑战、评价基准与技术演进脉络，重点归纳了模型架构范式、遥感领域适配策略和跨模态语义对齐机制，并结合代表性数据集、典型方法及性能比较总结了当前研究进展。分析表明，视觉-语言基础模型在缓解语义鸿沟、提升零样本迁移能力和增强复杂语义理解方面展现出显著优势，但多源异构数据统一建模、地理知识增强、开放场景持续学习以及轻量化部署仍是该方向亟待突破的关键问题。相关综述可为遥感多模态信息理解、跨模态检索模型设计及工程应用提供系统参考。
- 遥感图文检索 /
- 视觉-语言基础模型 /
- 跨模态检索 /
- 多模态学习
Abstract: Significance Remote sensing image–text retrieval (RS-TIR) connects massive Earth observation imagery with natural-language queries and has become an important interface for geospatial intelligence systems. Compared with conventional content-based retrieval, RS-TIR enables users to search scenes, objects, spatial layouts, and functional regions through semantic descriptions instead of handcrafted visual cues. This capability is increasingly needed in natural resource monitoring, urban governance, disaster response, environmental assessment, and on-demand retrieval from rapidly growing satellite archives. However, the task remains fundamentally challenging because remote sensing imagery is captured from a nadir or near-nadir perspective, exhibits strong rotation invariance, contains extreme scale variation from tiny vehicles to large airports, and often involves domain-specific semantic descriptions such as land-use attributes, spatial distributions, and geoscientific relations. Meanwhile, the amount of high-quality image–text annotation is still limited relative to the scale of remote sensing data. These properties enlarge the semantic gap between images and language and constrain the generalization ability of traditional cross-modal retrieval methods. Against this background, the review focuses on how vision–language foundation models (VLMs) reshape RS-TIR by introducing large-scale contrastive pre-training, stronger transferable representations, and more flexible multimodal interaction mechanisms. The review also clarifies why remote sensing adaptation is necessary and why a dedicated synthesis of architectures, datasets, alignment mechanisms, and future directions is timely for the field. Progress The technical development of RS-TIR is organized from three complementary perspectives. First, the review summarizes the domain-specific challenges that shape the task, including visually isotropic topology with extreme scale variation, professionalized and fine-grained textual semantics, and the compounded semantic gap between overhead imagery and natural-language descriptions (Fig.3). The overall survey structure is then outlined to show the logical progression from task formulation to future challenges (Fig.1). From the methodological timeline, RS-TIR evolves from handcrafted visual descriptors and shallow semantic mapping to deep representation learning, and then to VLM-driven paradigms with broader generalization and zero-shot transfer ability (Fig.4, Table 2). Early methods rely on color, texture, shape, and hash-based retrieval, but they struggle to model high-level geospatial semantics and complex scene composition. Deep learning methods improve retrieval by learning joint embedding spaces, adopting dual-encoder or interaction-based architectures, and introducing multi-scale feature fusion and region-aware matching. These methods substantially enhance semantic consistency, yet they still depend heavily on labeled data and often suffer from limited robustness in open or cross-sensor scenarios. Second, the review summarizes the benchmark ecosystem used to evaluate these methods. Representative datasets span small-scale test sets such as Sydney-Caption and UCM-Caption, mainstream benchmarks such as RSICD and RSITMD, and recent large-scale training resources such as RS5M and SkyScript (Table 1). These datasets reveal a clear transition from small manually annotated corpora to web-scale or automatically generated image–text pairs, which in turn supports domain pre-training and larger model adaptation. Third, the review analyzes the core VLM techniques now driving progress in RS-TIR. The model spectrum and representative architecture families, including contrastive dual-encoder models, multimodal interaction models, and remote sensing foundation models integrated with large language models, are summarized systematically (Fig.5, Fig.6, Table 3). Domain adaptation routes are further grouped into continued remote sensing pre-training, parameter-efficient transfer learning, adapter-based tuning, prompt learning, and instruction tuning. At the semantic alignment level, the review emphasizes contrastive joint embedding, fine-grained multi-scale alignment, and the incorporation of remote sensing priors such as spatial topology and geolocation. Performance comparisons on RSICD and RSITMD show that the introduction of remote sensing VLMs, especially RemoteCLIP, GeoRSCLIP, iEBAKER, and LRSCLIP, leads to consistent gains in mean Recall and overall retrieval robustness (Table 4). In parallel, the review also tracks the extension of retrieval capability into unified multi-task remote sensing models, where retrieval, grounding, segmentation, and reasoning begin to share a common multimodal representation space. Conclusions Several conclusions are drawn from the comparative analysis. First, VLMs establish a new dominant paradigm for RS-TIR because they significantly narrow the cross-modal semantic gap while improving transferability across datasets and scenes. Second, there is no universally optimal architecture: dual-encoder models remain attractive for large-scale retrieval because of their efficiency, whereas interaction-based or instruction-enhanced models offer finer semantic alignment at higher computational cost. Third, domain adaptation is indispensable. Continued pre-training on remote sensing image–text corpora, parameter-efficient tuning, and prompt-based adaptation consistently outperform direct reuse of Internet-trained VLMs, indicating that remote sensing imagery differs too strongly from natural-image distributions to rely on generic pre-training alone. Fourth, the most effective recent methods do not improve performance through scale alone; they also exploit remote sensing-specific information, including multi-scale structures, foreground entities, explicit keyword reasoning, and spatial priors. Finally, the review shows that the field is shifting from isolated retrieval models toward more general geospatial multimodal systems. Retrieval is no longer treated only as a matching task, but also as a key capability supporting question answering, instruction following, knowledge augmentation, and coordinated reasoning in remote sensing applications. Prospects Future research is expected to move in four closely related directions. One direction is the unified representation of multi-source heterogeneous data, especially the integration of optical imagery with synthetic aperture radar, hyperspectral data, thermal infrared observations, and multi-temporal acquisitions. Another direction is knowledge-enhanced retrieval, where geospatial priors, land-use rules, remote sensing terminology, and external knowledge bases are incorporated into multimodal alignment and retrieval-augmented reasoning. A third direction is lifelong and open-world learning. Real deployment requires models to remain reliable under seasonal changes, sensor updates, regional domain shifts, cloud contamination, and newly emerging categories without catastrophic forgetting. The fourth direction concerns efficiency and deployability. Because practical remote sensing systems often operate under tight computational budgets, lightweight tuning, sparse computation, token reduction, model compression, and on-orbit or edge inference will become increasingly important. Interactive and explainable retrieval is also likely to grow in importance, allowing analysts to refine queries through dialogue and inspect the image regions or semantic cues that support retrieval decisions. Overall, continued progress in data construction, domain adaptation, semantic alignment, and efficient multimodal modeling is expected to make RS-TIR a more robust infrastructure capability for Earth observation applications.
- Remote sensing image-text retrieval /
- Vision-language foundation models /
- Cross-modal retrieval /
- Multimodal learning

HTML全文

图 1 本文行文框架

下载: 全尺寸图片幻灯片

图 2 遥感图文检索任务流程

下载: 全尺寸图片幻灯片

图 3 遥感影像特征带来的跨模态挑战

下载: 全尺寸图片幻灯片

图 4 遥感图文检索技术发展阶段

下载: 全尺寸图片幻灯片

图 5 遥感图文检索相关模型谱系及任务扩展示意图

下载: 全尺寸图片幻灯片

图 6 视觉-语言模型两类典型架构示意图

下载: 全尺寸图片幻灯片

图 7 CLIP模型对比学习训练框架

下载: 全尺寸图片幻灯片

图 8 Vision Transformer模型结构示意图

下载: 全尺寸图片幻灯片

表 1 典型遥感图文检索数据集

数据集	图片数量	图像尺寸	描述模式/标注方式	类别
Sydney-Caption^[29]	613	500 × 500	每张图像5句描述	核心测试基准
UCM-Caption^[30]	2100	256 × 256	每张图像5句描述	核心测试基准
RSICD^[31]	10,921	224 × 224	每张图像1–5句描述	核心测试基准
RSITMD^[32]	4743	256 × 256	每张图像5句描述 + 细粒度关键词	核心测试基准
NWPU-Caption^[33,34]	31,500	256 × 256	每张图像5句描述	大规模预训练/训练数据
RSICap^[35]	2585	512 × 512	每张图像1句高质量人工标注描述	生成式/指令微调数据
RS5M^[36]	5M	全分辨率	关键词过滤 + BLIP-2生成	大规模预训练/训练数据
SkyScript^[37]	5.2M	全分辨率	自动生成 + CLIP 过滤	大规模预训练/训练数据
MMRS-1M subset^[38]	1M	全分辨率	多任务指令跟随	生成式/指令微调数据
GeoLangBind-2M subset^[39]	2M	全分辨率	数据集整合 + 自动生成	大规模预训练/训练数据
Git-10M^[40]	10M	全分辨率	自动生成 + 数据清洗	大规模预训练/训练数据

下载: 导出CSV

表 2 遥感图文检索技术演进

技术阶段	核心思想	技术特点	优势	局限
传统方法	手工特征+关键词匹配	低层视觉描述	结构简单、计算开销低、具有一定可解释性	难以表达高层语义，存在明显语义鸿沟
深度学习方法	表征学习	学习图文共享嵌入空间	语义表达能力增强，可实现端到端训练	依赖大规模标注数据，跨场景泛化能力有限
视觉语言基础模型	大规模预训练	VLM构建统一跨模态语义空间	泛化能力强，支持零样本迁移	存在领域分布差异，需遥感适配
VLM扩展与多任务统一模型	统一语义表示	将检索、检测、分割等任务纳入统一框架	任务协同、共享表示、增强场景理解能力	计算复杂度高，模型规模与训练成本较大

下载: 导出CSV

表 3 典型视觉-语言模型与遥感基础模型对比

模型	基础网络	训练数据	训练策略	应用领域
CLIP^[9]	ResNets、ViT	WIT(WebImageText)	对比学习	通用
Flamingo^[58]	NFNet、Transformer	COCO、OKVQA、VQAv2、MSVDQA	Gated Cross-Attention+ Perceiver Resampler	通用
ALBEF^[79]	BERT、ViT	COCO 和 Visual Genome	对比学习	通用
BLIP^[80]	BERT 、 ViT	COCO、Visual Genome、Conceptual Captions	对比学习	通用
LLaVA^[81]	Vicuna、CLIP	ScienceQA、CC-595K、 LLaVA-Instruct-158K	Projector+Instruction Tuning	通用
Qwen2.5-VL^[83]	ViT、Qwen2.5	多个数据集	动态分辨率预训练+ 多阶段指令微调	通用
GPT-4V^[84]	未公开	多个数据集	未公开	通用
Gemini^[85]	未公开	多个数据集	未公开	通用
RemoteCLIP^[10]	ViT-14	10 个数据集	MAE	遥感
RSGPT^[35]	InstructBLIP	图像＋文本描述＋指令	文本监督	遥感
GeoChat^[88]	LLaVA1.5	318k 个指令对的 RS 数据集	LoRA 微调	遥感
EarthPT^[89]	Unsupervised Multitask Learners	ClearSky	自回归	遥感
DINO-MM^[90]	Self-supervised Multitask Learners	多个数据集	蒸馏＋对比	遥感
SkySense^[91]	ViT	多个数据集	冻结＋微调	遥感
RingMo^[92]	ViT	多个数据集	MAE＋PIMask	遥感
RSPrompter^[93]	SAM	WHU,NWPU,SSDD	冻结＋微调	遥感
SpectralGPT^[94]	ViT	fMoW/BigEarthNetS2	MAE＋3DMask	遥感

下载: 导出CSV

表 4 遥感VLM检索性能比较

方法	年份	模型参数量	预训练数据规模	微调方式	RSICD Dataset	RSITMD Dataset
PIR[62]	2023	161M	5 million	视觉指令微调	23.48	39.09
RemoteCLIP[10]	2024	428M	0.83 million	持续预训练	35.02	50.68
GeoRSCLIP[36]	2024	151M	5 million	参数高效微调	38.26	52.43
SkyCLIP[37]	2024	428M	2.6 million	持续预训练/零样本迁移	19.97	30.58
iEBAKER[110]	2025	151M	0.2 million	直接微调	43.41	55.65
LRSCLIP[11]	2025	367M	2 million	全量微调	48.34	65.04

下载: 导出CSV

参考文献(110)

[1]	ZHANG Lefei and ZHANG Liangpei. Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities[J]. IEEE Geoscience and Remote Sensing Magazine, 2022, 10(2): 270–294. doi: 10.1109/MGRS.2022.3145854.
[2]	SUN Ziheng, SANDOVAL L, CRYSTAL-ORNELAS R, et al. A review of earth artificial intelligence[J]. Computers & Geosciences, 2022, 159: 105034. doi: 10.1016/j.cageo.2022.105034.
[3]	CHAUDHURI U, BANERJEE B, BHATTACHARYA A, et al. CMIR-NET: A deep learning based model for cross-modal retrieval in remote sensing[J]. Pattern Recognition Letters, 2020, 131: 456–462. doi: 10.1016/j.patrec.2020.02.006.
[4]	XU Lingxin, WANG Luyao, ZHANG Jinzhi, et al. A review of cross-modal image-text retrieval in remote sensing[J]. Remote Sensing, 2025, 17(24): 3995. doi: 10.3390/rs17243995.
[5]	MANDAL D, CHAUDHURY K N, and BISWAS S. Generalized semantic preserving hashing for n-label cross-modal retrieval[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 2633–2641. doi: 10.1109/CVPR.2017.282.
[6]	YUAN Zhiqiang, ZHANG Wenkai, TIAN Changyuan, et al. Remote sensing cross-modal text-image retrieval based on global and local information[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5620616. doi: 10.1109/TGRS.2022.3163706.
[7]	ZHANG Jingqi, JI Jiaqi, RU Huiying, et al. A global-local dual-stream collaborative enhancement model for cross-modal image and text retrieval in remote sensing scenarios[J]. AIP Advances, 2025, 15(8): 085009. doi: 10.1063/5.0280560.
[8]	PAN Jiancheng, MA Qing, and BAI Cong. A prior instruction representation framework for remote sensing image-text retrieval[C]. Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023: 611–620. doi: 10.1145/3581783.3612374.
[9]	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]. Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 2021: 8748–8763. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
[10]	LIU Fan, CHEN Delong, GUAN Zhangqingyun, et al. RemoteCLIP: A vision language foundation model for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5622216. doi: 10.1109/TGRS.2024.3390838.
[11]	CHEN Weizhi, DENG Yupeng, JIN Wei, et al. DGTRSD and DGTRSCLIP: A dual-granularity remote sensing image–text dataset and vision–language foundation model for alignment[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025, 18: 29113–29130. doi: 10.1109/JSTARS.2025.3625958.
[12]	YANG Xiaoyu, LI Chao, WANG Zhiming, et al. Remote sensing cross-modal text-image retrieval based on attention correction and filtering[J]. Remote Sensing, 2025, 17(3): 503. doi: 10.3390/rs17030503.
[13]	GUAN Jihong, SHU Yulou, LI Wengen, et al. PR-CLIP: Cross-modal positional reconstruction for remote sensing image-text retrieval[J]. Remote Sensing, 2025, 17(13): 2117. doi: 10.3390/rs17132117.
[14]	WANG Yijing, TANG Xu, MA Jingjing, et al. Cross-modal remote sensing image-text retrieval via context and uncertainty-aware prompt[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(6): 11384–11398. doi: 10.1109/TNNLS.2024.3458898.
[15]	HOXHA G, ANGYAL O, and DEMIR B. Self-supervised cross-modal text-image time series retrieval in remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5639115. doi: 10.1109/TGRS.2025.3601393.
[16]	WANG Tianshi, LI Fengling, ZHU Lei, et al. Cross-modal retrieval: A systematic review of methods and future directions[J]. Proceedings of the IEEE, 2024, 112(11): 1716–1754. doi: 10.1109/JPROC.2024.3525147.
[17]	BALTRUŠAITIS T, AHUJA C, and MORENCY L P. Multimodal machine learning: A survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423–443. doi: 10.1109/TPAMI.2018.2798607.
[18]	NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]. Proceedings of the 28th International Conference on Machine Learning, Bellevue, USA, 2011: 689–696.
[19]	KIROS R, SALAKHUTDINOV R, and ZEMEL R S. Unifying visual-semantic embeddings with multimodal neural language models[J]. arXiv preprint arXiv: 1411.2539, 2014. doi: 10.48550/arXiv.1411.2539. <b>(查阅网上资料,不确定文献类型及格式是否正确,请确认)</b>.
[20]	刘扬, 付征叶, 郑逢斌. 高分辨率遥感影像目标分类与识别研究进展[J]. 地球信息科学学报, 2015, 17(9): 1080–1091. doi: 10.3724/SP.J.1047.2015.01080. LIU Yang, FU Zhengye, and ZHENG Fengbin. Review on high resolution remote sensing image classification and recognition[J]. Journal of Geo-Information Science, 2015, 17(9): 1080–1091. doi: 10.3724/SP.J.1047.2015.01080.
[21]	XIA Guisong, BAI Xiang, DING Jian, et al. DOTA: A large-scale dataset for object detection in aerial images[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, 2018: 3974–3983. doi: 10.1109/CVPR.2018.00418.
[22]	周月敏, 周翔. 中国高分辨率对地观测系统共性应用技术规范体系框架研究[J]. 地球信息科学学报, 2018, 20(9): 1298–1305. doi: 10.12082/dqxxkx.2018.180144. ZHOU Yuemin and ZHOU Xiang. The system framework of technical standards for common applications in China high-resolution earth observation system[J]. Journal of Geo-Information Science, 2018, 20(9): 1298–1305. doi: 10.12082/dqxxkx.2018.180144.
[23]	LI Ke, WAN Gang, CHENG Gong, et al. Object detection in optical remote sensing images: A survey and a new benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2020, 159: 296–307. doi: 10.1016/j.isprsjprs.2019.11.023.
[24]	佃袁勇, 方圣辉, 姚崇怀. 多尺度分割的高分辨率遥感影像变化检测[J]. 遥感学报, 2016, 20(1): 129–137. doi: 10.11834/jrs.20165074. DIAN Yuanyong, FANG Shenghui, and YAO Chonghuai. Change detection for high-resolution images using multilevel segment method[J]. Journal of Remote Sensing, 2016, 20(1): 129–137. doi: 10.11834/jrs.20165074.
[25]	ZHU Xiaoxiang, TUZEL O, MOU Lichao, et al. Deep learning in remote sensing: A comprehensive review and list of resources[J]. IEEE Geoscience and Remote Sensing Magazine, 2017, 5(4): 8–36. doi: 10.1109/MGRS.2017.2762307.
[26]	袁翔, 程塨, 李戈, 等. 遥感影像小目标检测研究进展[J]. 中国图象图形学报, 2023, 28(6): 1662–1684. doi: 10.11834/jig.221202. YUAN Xiang, CHENG Gong, LI Ge, et al. Progress in small object detection for remote sensing images[J]. Journal of Image and Graphics, 2023, 28(6): 1662–1684. doi: 10.11834/jig.221202.
[27]	ZHAO Beigeng. A systematic survey of remote sensing image captioning[J]. IEEE Access, 2021, 9: 154086–154111. doi: 10.1109/ACCESS.2021.3128140.
[28]	LI Xiang, WEN Congcong, HU Yuan, et al. Vision-language models in remote sensing: Current progress and future trends[J]. IEEE Geoscience and Remote Sensing Magazine, 2024, 12(2): 32–66. doi: 10.1109/MGRS.2024.3383473.
[29]	QU Bo, LI Xuelong, TAO Dacheng, et al. Deep semantic understanding of high resolution remote sensing image[C]. Proceedings of International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 2016: 1–5. doi: 10.1109/CITS.2016.7546397.
[30]	YANG Yi and NEWSAM S. Bag-of-visual-words and spatial extensions for land-use classification[C]. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, USA, 2010: 270–279. doi: 10.1145/1869790.1869829.
[31]	LU Xiaoqiang, WANG Binqiang, ZHENG Xiangtao, et al. Exploring models and data for remote sensing image caption generation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(4): 2183–2195. doi: 10.1109/TGRS.2017.2776321.
[32]	YUAN Zhiqiang, ZHANG Wenkai, FU Kun, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119. doi: 10.1109/TGRS.2021.3078451.
[33]	CHENG Gong, HAN Junwei, and LU Xiaoqiang. Remote sensing image scene classification: Benchmark and state of the art[J]. Proceedings of the IEEE, 2017, 105(10): 1865–1883. doi: 10.1109/JPROC.2017.2675998.
[34]	CHENG Qimin, HUANG Haiyan, XU Yuan, et al. NWPU-captions dataset and MLCA-Net for remote sensing image captioning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 5629419. doi: 10.1109/TGRS.2022.3201474.
[35]	HU Yuan, YUAN Jianlong, WEN Congcong, et al. RSGPT: A remote sensing vision-language model and benchmark[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2025, 224: 272–286. doi: 10.1016/j.isprsjprs.2025.03.028.
[36]	ZHANG Zilun, ZHAO Tiancheng, GUO Yulong, et al. RS5M and GeoRSCLIP: A large-scale vision-language dataset and a large vision-language model for remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5642123. doi: 10.1109/TGRS.2024.3449154.
[37]	WANG Zhecheng, PRABHA R, HUANG Tianyuan, et al. WANG Zhecheng, PRABHA R, HUANG Tianyuan, et al. SkyScript: A large and semantically diverse vision-language dataset for remote sensing[C]. Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024. 5805–5813. doi: 10.1609/aaai.v38i6.28393.
[38]	ZHANG Wei, CAI Miaoxin, ZHANG Tong, et al. EarthGPT: A universal multimodal large language model for multisensor image comprehension in remote sensing domain[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5917820. doi: 10.1109/TGRS.2024.3409624.
[39]	XIONG Zhitong, WANG Yi, YU Weikang, et al. GeoLangBind: Unifying earth observation with agglomerative vision-language foundation models[J]. arXiv preprint arXiv: 2503.06312, 2025. doi: 10.48550/arXiv.2503.06312. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[40]	LIU Chenyang, CHEN Keyan, ZHAO Rui, et al. Text2Earth: Unlocking text-driven remote sensing image generation with a global-scale dataset and a foundation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2025, 13(3): 238–259. doi: 10.1109/MGRS.2025.3560455.
[41]	RUI Yong, HUANG T S, and CHANG S F. Image retrieval: Current techniques, promising directions, and open issues[J]. Journal of Visual Communication and Image Representation, 1999, 10(1): 39–62. doi: 10.1006/jvci.1999.0413.
[42]	DATCU M, DASCHIEL H, PELIZZARI A, et al. Information mining in remote sensing image archives: System concepts[J]. IEEE Transactions on Geoscience and Remote Sensing, 2003, 41(12): 2923–2936. doi: 10.1109/TGRS.2003.817197.
[43]	MIKRIUKOV G, RAVANBAKHSH M, and DEMIR B. Unsupervised contrastive hashing for cross-modal retrieval in remote sensing[C]. Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 2022: 4463–4467. doi: 10.1109/ICASSP43922.2022.9746251.
[44]	MIKRIUKOV G, RAVANBAKHSH M, and DEMIR B. Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing[J]. arXiv preprint arXiv: 2201.08125, 2022. doi: 10.48550/arXiv.2201.08125. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[45]	郑富中, 张海粟, 张雄, 等. 遥感图像跨模态语义检索技术与应用[J]. 指挥与控制学报, 2026, 12(1): 21–30. doi: 10.20278/j.jc2.2096-0204.2023.0126. ZHENG Fuzhong, ZHANG Haisu, ZHANG Xiong, et al. Remote sensing cross-modal semantic retrieval technology and application[J]. Journal of Command and Control, 2026, 12(1): 21–30. doi: 10.20278/j.jc2.2096-0204.2023.0126.
[46]	DASCHIEL H and DATCU M. Information mining in remote sensing image archives: System evaluation[J]. IEEE Transactions on Geoscience and Remote Sensing, 2005, 43(1): 188–199. doi: 10.1109/TGRS.2004.838374.
[47]	SMEULDERS A W M, WORRING M, SANTINI S, et al. Content-based image retrieval at the end of the early years[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(12): 1349–1380. doi: 10.1109/34.895972.
[48]	CHENG Gong and HAN Junwei. A survey on object detection in optical remote sensing images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2016, 117: 11–28. doi: 10.1016/j.isprsjprs.2016.03.014.
[49]	O’SHEA K and NASH R. An introduction to convolutional neural networks[J]. arXiv preprint arXiv: 1511.08458, 2015. doi: 10.48550/arXiv.1511.08458. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[50]	SHERSTINSKY A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network[J]. Physica D: Nonlinear Phenomena, 2020, 404: 132306. doi: 10.1016/j.physd.2019.132306.
[51]	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, 2017: 6000– 6010.
[52]	FROME A, CORRADO G S, SHLENS J, et al. DeViSE: A deep visual-semantic embedding model[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems, Lake Tahoe, USA, 2013: 2121–2129.
[53]	GUO Mao, YUAN Yuan, and LU Xiaoqiang. Deep cross-modal retrieval for remote sensing image and audio[C]. Proceedings of the 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China, 2018: 1–7. doi: 10.1109/PRRS.2018.8486338.
[54]	WANG Liwei, LI Yin, and LAZEBNIK S. Learning deep structure-preserving image-text embeddings[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016: 5005–5013. doi: 10.1109/CVPR.2016.541.
[55]	ZHU Zicong, KANG Jian, DIAO Wenhui, et al. SIRS: Multitask joint learning for remote sensing foreground-entity image-text retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 5625615. doi: 10.1109/TGRS.2024.3402216.
[56]	LI Chunyuan, GAN Zhe, YANG Zhengyuan, et al. Multimodal foundation models: From specialists to general-purpose assistants[J]. Foundations and Trends in Computer Graphics and Vision, 2024, 16(1/2): 1–214. doi: 10.1561/0600000110.
[57]	WANG Wenhui, BAO Hangbo, DONG Li, et al. Image as a foreign language: BEiT pretraining for vision and vision-language tasks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 2023: 19175–19186. doi: 10.1109/cvpr52729.2023.01838.
[58]	ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: A visual language model for few-shot learning[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1723.
[59]	JIA Chao, YANG Yinfei, XIA Ye, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]. Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 2021: 4904–4916. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
[60]	WORTSMAN M, ILHARCO G, KIM J W, et al. Robust fine-tuning of zero-shot models[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 7949–7961. doi: 10.1109/CVPR52688.2022.00780.
[61]	PELEG A, SINGH N D, and HEIN M. Advancing compositional awareness in CLIP with efficient fine-tuning[J]. arXiv preprint arXiv: 2505.24424, 2025. doi: 10.48550/arXiv.2505.24424. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[62]	PAN Jiancheng, MA Muyuan, MA Qing, et al. PAN Jiancheng, MA Muyuan, MA Qing, et al. PIR: Remote sensing image-text retrieval with prior instruction representation learning[J]. arXiv preprint arXiv: 2405.10160, 2024. (查阅网上资料,不确定文献类型及格式是否正确,请确认). doi: 10.48550/arXiv.2405.10160.
[63]	GAO Peng, GENG Shijie, ZHANG Renrui, et al. CLIP-adapter: Better vision-language models with feature adapters[J]. International Journal of Computer Vision, 2024, 132(2): 581–595. doi: 10.1007/s11263-023-01891-x.
[64]	ZHANG Renrui, ZHANG Wei, FANG Rongyao, et al. Tip-Adapter: Training-free adaption of CLIP for few-shot classification[C]. Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022: 493–510. doi: 10.1007/978-3-031-19833-5_29.
[65]	ZHOU Kaiyang, YANG Jingkang, LOY C C, et al. Learning to prompt for vision-language models[J]. International Journal of Computer Vision, 2022, 130(9): 2337–2348. doi: 10.1007/s11263-022-01653-1.
[66]	ZHOU Kaiyang, YANG Jingkang, LOY C C, et al. Conditional prompt learning for vision-language models[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 16795–16804. doi: 10.1109/CVPR52688.2022.01631.
[67]	王懿婧, 唐旭, 韩硕, 等. 遥感跨模态图文检索: 关键技术与挑战[J]. 遥感学报, 2026, 30(2): 262–278. doi: 10.11834/jrs.20255437. WANG Yijing, TANG Xu, HAN Shuo, et al. Remote sensing cross-modal image-text retrieval: Key technologies and challenges[J]. National Remote Sensing Bulletin, 2026, 30(2): 262–278. doi: 10.11834/jrs.20255437.
[68]	付琨, 卢宛萱, 刘小煜, 等. 遥感基础模型发展综述与未来设想[J]. 遥感学报, 2024, 28(7): 1667–1680. doi: 10.11834/jrs.20233313. FU Kun, LU Wanxuan, LIU Xiaoyu, et al. A comprehensive survey and assumption of remote sensing foundation modal[J]. National Remote Sensing Bulletin, 2024, 28(7): 1667–1680. doi: 10.11834/jrs.20233313.
[69]	LIU Jiaqi, FU Ronghao, SUN Lang, et al. SkyMoE: A vision-language foundation model for enhancing geospatial interpretation with mixture of experts[C]. Proceedings of the 40th AAAI Conference on Artificial Intelligence, Singapore, Singapore, 2026: 7168–7178. doi: 10.1609/aaai.v40i9.37653.
[70]	WANG Peijin, HU Huiyang, TONG Boyuan, et al. RingmoGPT: A unified remote sensing foundation model for vision, language, and grounded tasks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2025, 63: 5611320. doi: 10.1109/TGRS.2024.3510833.
[71]	HU Huiyang, WANG Peijin, FENG Yingchao, et al. RingMo-Agent: A unified remote sensing foundation model for multi-platform and multi-modal reasoning[J]. arXiv preprint arXiv: 2507.20776, 2025. doi: 10.48550/arXiv.2507.20776. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[72]	BI Hanbo, FENG Yingchao, TONG Boyuan, et al. RingMoE: Mixture-of-modality-experts multi-modal foundation models for universal remote sensing image interpretation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026, 48(4): 4388–4405. doi: 10.1109/TPAMI.2025.3643453.
[73]	ZHANG Zilun, SHEN Haozhan, ZHAO Tiancheng, et al. GeoRSMLLM: A multimodal large language model for vision-language tasks in geoscience and remote sensing[J]. arXiv preprint arXiv: 2503.12490, 2025. doi: 10.48550/arXiv.2503.12490. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[74]	WANG Junjue, XUAN Weihao, QI Heli, et al. DisasterM3: A remote sensing vision-language dataset for disaster damage assessment and response[C]. Proceedings of the 39th International Conference on Neural Information Processing Systems, San Diego, USA, 2025.
[75]	张帅豪, 潘志刚. 遥感大模型: 综述与未来设想[J]. 遥感技术与应用, 2025, 40(1): 1–13. doi: 10.11873/j.issn.1004-0323.2025.1.0001. ZHANG Shuaihao and PAN Zhigang. Remote sensing large models: Review and future prospects[J]. Remote Sensing Technology and Application, 2025, 40(1): 1–13. doi: 10.11873/j.issn.1004-0323.2025.1.0001.
[76]	LI L H, YATSKAR M, YIN D, et al. VisualBERT: A simple and performant baseline for vision and language[J]. arXiv preprint arXiv: 1908.03557, 2019. doi: 10.48550/arXiv.1908.03557. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[77]	CHEN Y C, LI Linjie, YU Licheng, et al. UNITER: UNiversal image-TExt representation learning[C]. Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 2020: 104–120. doi: 10.1007/978-3-030-58577-8_7.
[78]	WANG Peng, YANG An, MEN Rui, et al. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 23318–23340.
[79]	LI Junnan, SELVARAJU R R, GOTMARE A D, et al. Align before fuse: Vision and language representation learning with momentum distillation[C]. Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual Event, 2021: 742. (查阅网上资料, 未找到本条文献出版地信息, 请确认).
[80]	LI Junnan, LI Dongxu, XIONG Caiming, et al. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]. Proceedings of the 39th International Conference on Machine Learning, Baltimore, USA, 2022: 12888–12900.
[81]	LIU Haotian, LI Chunyuan, WU Qingyang, et al. Visual instruction tuning[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 1516.
[82]	DAI Wenliang, LI Junnan, LI Dongxu, et al. InstructBLIP: Towards general-purpose vision-language models with instruction tuning[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 2142.
[83]	BAI Shuai, CHEN Keqin, LIU Xuejing, et al. Qwen2.5-VL technical report[J]. arXiv preprint arXiv: 2502.13923, 2025. doi: 10.48550/arXiv.2502.13923. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[84]	YANG Zhengyuan, LI Linjie, LIN K, et al. The dawn of LMMs: Preliminary explorations with GPT-4V(ision)[J]. arXiv preprint arXiv: 2309.17421, 2023. doi: 10.48550/arXiv.2309.17421. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[85]	Gemini Team Google. Gemini: A family of highly capable multimodal models[J]. arXiv preprint arXiv: 2312.11805, 2023. doi: 10.48550/arXiv.2312.11805. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[86]	CHEN Xi, WANG Xiao, CHANGPINYO S, et al. PaLI: A jointly-scaled multilingual language-image model[C]. Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
[87]	PENG Zhiliang, WANG Wenhui, DONG Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv preprint arXiv: 2306.14824, 2023. doi: 10.48550/arXiv.2306.14824. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[88]	KUCKREJA K, DANISH M S, NASEER M, et al. GeoChat: Grounded large vision-language model for remote sensing[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 27831–27840. doi: 10.1109/CVPR52733.2024.02629.
[89]	SMITH M J, FLEMING L, and GEACH J E. EarthPT: A time series foundation model for earth observation[J]. arXiv preprint arXiv: 2309.07207, 2024. doi: 10.48550/arXiv.2309.07207.
[90]	WANG Yi, ALBRECHT C M, and ZHU Xiaoxiang. Self-supervised vision transformers for joint SAR-optical representation learning[C]. Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022: 139–142. doi: 10.1109/IGARSS46834.2022.9883983.
[91]	GUO Xin, LAO Jiangwei, DANG Bo, et al. SkySense: A multi-modal remote sensing foundation model towards universal interpretation for earth observation imagery[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 27662–27673. doi: 10.1109/CVPR52733.2024.02613.
[92]	SUN Xian, WANG Pengjing, LU Wanxuan, et al. RingMo: A remote sensing foundation model with masked image modeling[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5612822. doi: 10.1109/TGRS.2022.3194732.
[93]	CHEN Keyan, LIU Chenyang, CHEN Hao, et al. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62: 4701117. doi: 10.1109/TGRS.2024.3356074.
[94]	HONG Danfeng, ZHANG Bing, LI Xuyang, et al. SpectralGPT: Spectral remote sensing foundation model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(8): 5227–5244. doi: 10.1109/TPAMI.2024.3362475.
[95]	YUAN Yuan, ZHAN Yang, and XIONG Zhitong. Parameter-efficient transfer learning for remote sensing image-text retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 5619014. doi: 10.1109/TGRS.2023.3308969.
[96]	GENG Shijie, YUAN Jianbo, TIAN Yu, et al. HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention[C]. Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023.
[97]	SUNG Y L, CHO J, and BANSAL M. VL-ADAPTER: Parameter-efficient transfer learning for vision-and-language tasks[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, 2022: 5217–5227. doi: 10.1109/CVPR52688.2022.00516.
[98]	LU Haoyu, HUO Yuqi, YANG Guoxing, et al. UniAdapter: Unified parameter-efficient transfer learning for cross-modal modeling[C]. Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024.
[99]	HINTON G, VINYALS O, and DEAN J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv: 1503.02531, 2015. doi: 10.48550/arXiv.1503.02531. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[100]	LERNER P, FERRET O, and GUINAUDEAU C. Cross-modal retrieval for knowledge-based visual question answering[C]. Proceedings of the 46th European Conference on Information Retrieval on Advances in Information Retrieval, Glasgow, UK, 2024: 421–438.
[101]	CEPEDA V V, NAYAK G K, and SHAH M. GeoCLIP: Clip-inspired alignment between locations and images for effective worldwide geo-localization[C]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 379.
[102]	WEN Congcong, LIN Yiting, QU Xiaokang, et al. Remote sensing retrieval-augmented generation: Bridging remote sensing imagery and comprehensive knowledge with a multimodal dataset and retrieval-augmented generation model[J]. IEEE Geoscience and Remote Sensing Magazine, 2026, 14(2): 85–103. doi: 10.1109/MGRS.2025.3645852.
[103]	WEI Chen, ZHANG Yiwei, GUAN Xi, et al. GeoAI for driving risk assessment via vision-language models: A knowledge guided RAG system and dual evaluation[C]. Proceedings of the 8th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '25), Minneapolis, USA, 2025: 161–173. doi: 10.1145/3764912.3770832.
[104]	XU Wenjia, YU Zijian, MU Boyang, et al. RS-Agent: Automating remote sensing tasks through intelligent agents[J]. arXiv preprint arXiv: 2406.07089, 2024. doi: 10.48550/arXiv.2406.07089. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[105]	SHAO Run, LI Ziyu, ZHANG Zhaoyang, et al. Asking like Socrates: Socrates helps VLMs understand remote sensing images[J]. arXiv preprint arXiv: 2511.22396, 2025. doi: 10.48550/arXiv.2511.22396. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
[106]	RUSAK E, REIZINGER P, JUHOS A, et al. InfoNCE: Identifying the gap between theory and practice[C]. Proceedings of the 28th International Conference on Artificial Intelligence and Statistics, Mai Khao, Thailand, 2025: 4159–4167.
[107]	WANG Jianren, FANG Zhaoyuan, and ZHAO Hang. AlignNet: A unifying approach to audio-visual alignment[C]. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, USA, 2020: 3298–3306. doi: 10.1109/WACV45572.2020.9093345.
[108]	AKSAN E and HILLIGES O. STCN: Stochastic temporal convolutional networks[C]. Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
[109]	JAYASUMANA S, RAMALINGAM S, VEIT A, et al. Rethinking FID: Towards a better evaluation metric for image generation[C]. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2024: 9307–9315. doi: 10.1109/CVPR52733.2024.00889.
[110]	ZHANG Yan, JI Zhong, MENG Changxu, et al. iEBAKER: Improved remote sensing image-text retrieval framework via eliminate before align and keyword explicit reasoning[J]. Expert Systems with Applications, 2026, 296: 128968. doi: 10.1016/j.eswa.2025.128968.