AutoPenGPT: Drift-Resistant Penetration Testing Driven by Search-Space Convergence and Dependency Modeling

HUANG Weigang; FU Lirong; LIU Peiyu; DU Linkang; YE Tong; XIA Yifan; WANG Wenhai

doi:10.11999/JEIT250873

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 >

HUANG Weigang, FU Lirong, LIU Peiyu, DU Linkang, YE Tong, XIA Yifan, WANG Wenhai. AutoPenGPT: Drift-Resistant Penetration Testing Driven by Search-Space Convergence and Dependency Modeling[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250873

Citation:

HUANG Weigang, FU Lirong, LIU Peiyu, DU Linkang, YE Tong, XIA Yifan, WANG Wenhai. AutoPenGPT: Drift-Resistant Penetration Testing Driven by Search-Space Convergence and Dependency Modeling[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250873

Citation:

HUANG Weigang, FU Lirong, LIU Peiyu, DU Linkang, YE Tong, XIA Yifan, WANG Wenhai. AutoPenGPT: Drift-Resistant Penetration Testing Driven by Search-Space Convergence and Dependency Modeling[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250873

PDF( 3576 KB)

AutoPenGPT: Drift-Resistant Penetration Testing Driven by Search-Space Convergence and Dependency Modeling

doi: 10.11999/JEIT250873 cstr: 32379.14.JEIT250873

1.
Zhejiang University, Hangzhou 310000, China
2.
Hangzhou Dianzi University, Hangzhou 310000, China
3.
Xi’an Jiaotong University, Xi’an 710000, China

Funds: The National Natural Science Foundation of China(62302443), The Postdoctoral Fellowship Program and China Postdoctoral Science Foundation(BX20230307), The Fundamental Research Funds for the Central Universities (2025ZFJH02)

Received Date: 2025-09-04
Accepted Date: 2025-12-31
Rev Recd Date: 2025-12-31

Available Online: 2026-01-15

Abstract

Abstract

Objective Industrial Control Systems (ICS) are widely deployed in critical sectors and often contain long-standing vulnerabilities due to strict availability requirements and limited patching opportunities. The increasing exposure of external management and access infrastructure has expanded the attack surface and allows adversaries to pivot from boundary components into fragile production networks. Continuous penetration testing of these components is essential but remains costly and difficult to scale when carried out manually. Recent work examines Large Language Models (LLMs) for automated penetration testing; however, existing systems often experience strategy drift and intention drift, which produce incoherent testing behaviors and ineffective exploitation chains. Methods This study proposes AutoPenGPT, a multi-agent framework for automated Web security testing. AutoPenGPT uses an adaptive exploration-space convergence mechanism that predicts likely vulnerability types from target semantics and constrains LLM-driven testing through a dynamically updated payload knowledge base. To reduce intention drift in multi-step exploitation, a dependency-driven strategy module rewrites historical feedback, models step dependencies, and generates coherent, executable strategies in a closed-loop workflow. A semi-structured prompt embedding scheme is also developed to support heterogeneous penetration testing tasks while preserving semantic integrity. Results and Discussions AutoPenGPT is evaluated on Capture-the-Flag (CTF) benchmarks and real-world ICS and Web platforms. On CTF datasets, it achieves 97.62% vulnerability-type detection accuracy and an 80.95% requirement completion rate, exceeding state-of-the-art tools by a wide margin. In real-world deployments, it reaches approximately 70% requirement completion and identifies six previously undisclosed vulnerabilities, demonstrating practical effectiveness. Conclusions The contributions are threefold. (1) Strategy drift and intention drift in LLM-driven penetration testing are examined and addressed through adaptive exploration and dependency-aware strategy mechanisms that stabilize long-horizon testing behaviors. (2) AutoPenGPT is designed and implemented as a multi-agent penetration testing system that integrates semantic vulnerability prediction, closed-loop strategy generation, and semi-structured prompt embedding. (3) Extensive evaluation on CTF and real-world ICS and Web platforms confirms the effectiveness and practicality of the system, including the discovery of previously unknown vulnerabilities.
- Large language models,
- Intelligent penetration testing,
- Web applications,
- Industrial control system security

FullText(HTML)

References(28)

References

[1]	PAN Xiaojun, WANG Zhuoran, and SUN Yanbin. Review of PLC security issues in industrial control system[J]. Journal of Cyber Security, 2020, 2(2): 69–83. doi: 10.32604/jcs.2020.010045.
[2]	ASLAM M M, TUFAIL A, APONG R A A H M, et al. Scrutinizing security in industrial control systems: An architectural vulnerabilities and communication network perspective[J]. IEEE Access, 2024, 12: 67537–67573. doi: 10.1109/ACCESS.2024.3394848.
[3]	LIU Chenyang, ALROWAILI Y, SAXENA N, et al. Cyber risks to critical smart grid assets of industrial control systems[J]. Energies, 2021, 14(17): 5501. doi: 10.3390/en14175501.
[4]	KASNECI E, SESSLER K, KÜCHEMANN S, et al. ChatGPT for good? On opportunities and challenges of large language models for education[J]. Learning and Individual Differences, 2023, 103: 102274. doi: 10.1016/j.lindif.2023.102274.
[5]	GE Yingqiang, HUA Wenyue, MEI Kai, et al. OpenAGI: When LLM meets domain experts[C]. The 37th International Conference on Neural Information Processing Systems, New Orleans, USA, 2023: 242.
[6]	DENG Gelei, LIU Yang, MAYORAL-VILCHES V, et al. PENTESTGPT: Evaluating and harnessing large language models for automated penetration testing[C]. The 33rd USENIX Conference on Security Symposium, Philadelphia, USA, 2024: 48.
[7]	KONG He, HU Die, GE Jingguo, et al. VulnBot: Autonomous penetration testing for a multi-agent collaborative framework[J]. arXiv preprint arXiv: 2501.13411, 2025. doi: 10.48550/arXiv.2501.13411. (查阅网上资料,不确定本文献类型是否正确,请确认).
[8]	ZHUO Jingming, ZHANG Songyang, FANG Xinyu, et al. ProSA: Assessing and understanding the prompt sensitivity of LLMs[C]. Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, USA, 2024: 1950–1976. doi: 10.18653/v1/2024.findings-emnlp.108.
[9]	CLAROTY. Getting from 5 to 0 - VPN security flaws pose cyber risk to organizations with remote OT personnel[EB/OL]. https://www.globalsecuritymag.com/Getting-from-5-to-0-VPN-Security,20200729,101254.html, 2020.
[10]	Censys Research Team. Over 145, 000 exposed ICS services worldwide[EB/OL]. https://industrialcyber.co/industrial-cyber-attacks/censys-data-reports-over-145000-exposed-ics-services-worldwide-highlights-us-vulnerabilities, 2024. (查阅网上资料,未找到本条文献信息,请确认).
[11]	CLAROTY. OT operators slow to update vulnerable Secomea remote access devices[EB/OL]. https://claroty.com/team82/research/ot-operators-slow-to-update-vulnerable-secomea-remote-access-devices, 2020.
[12]	Acunetix. Acunetix web vulnerability scanner overview[EB/OL]. https://www.acunetix.com/support/docs/wvs/overview/, 2025. (查阅网上资料,未找到本条文献出版年信息,请确认).
[13]	OWASP. Zed attack proxy (ZAP)[EB/OL]. https://www.zaproxy.org/, 2025. (查阅网上资料,未找到本条文献作者和出版年信息,请确认).
[14]	JFrog. Xray. Software composition analysis (SCA) tool[EB/OL]. https://jfrog.com/xray/, 2025. (查阅网上资料,未找到本条文献信息,请确认).
[15]	Sqlmap. Automatic SQL injection and database takeover tool[EB/OL]. https://sqlmap.org/, 2025.
[16]	HUANG Dong, DAI Jianbo, WENG Han, et al. EffiLearner: Enhancing efficiency of generated code via self-optimization[J]. arXiv preprint arXiv: 2405.15189, 2024. doi: 10.48550/arXiv.2405.15189. (查阅网上资料,不确定本文献类型是否正确,请确认).
[17]	LIU Zihan, ZENG Ruinan, WANG Dongxia, et al. Agents4PLC: Automating closed-loop PLC code generation and verification in industrial control systems using LLM-based agents[J]. arXiv preprint arXiv: 2410.14209, 2024. doi: 10.48550/arXiv.2410.14209. (查阅网上资料,不确定本文献类型是否正确,请确认).
[18]	LIU Peiyu, LIU Junming, FU Lirong, et al. Exploring ChatGPT’s capabilities on vulnerability management[C]. The 33rd USENIX Conference on Security Symposium, Philadelphia, USA, 2024: 46.
[19]	WANG Che, ZHANG Jiashuo, GAO Jianbo, et al. ContractTinker: LLM-empowered vulnerability repair for real-world smart contracts[C]. The 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, USA, 2024: 2350–2353.
[20]	LIU Zijun, ZHANG Yanzhe, LI Peng, et al. A dynamic LLM-powered agent network for task-oriented agent collaboration[J]. arXiv preprint arXiv: 2310.02170, 2023. doi: 10.48550/arXiv.2310.02170. (查阅网上资料,不确定本文献类型是否正确,请确认).
[21]	HAPPE A, KAPLAN A, and CITO J. LLMs as hackers: Autonomous Linux privilege escalation attacks[J]. arXiv preprint arXiv: 2310.11409, 2023. doi: 10.48550/arXiv.2310.11409. (查阅网上资料,不确定本文献类型是否正确,请确认).
[22]	HUANG Junjie and ZHU Quanyan. PenHeal: A two-stage LLM framework for automated pentesting and optimal remediation[C]. The Workshop on Autonomous Cybersecurity, Salt Lake City, USA, 2024: 11–22. doi: 10.1145/3689933.3690831.
[23]	WEI J, WANG Xuezhi, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]. Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, 2022: 1800.
[24]	SAHOO P, SINGH A K, SAHA S, et al. A systematic survey of prompt engineering in large language models: Techniques and applications[J]. arXiv preprint arXiv: 2402.07927, 2024. doi: 10.48550/arXiv.2402.07927. (查阅网上资料,不确定本文献类型是否正确,请确认).
[25]	GIOACCHINI L, MELLIA M, DRAGO I, et al. AutoPenBench: Benchmarking generative agents for penetration testing[J]. arXiv preprint arXiv: 2410.03225, 2024. doi: 10.48550/arXiv.2410.03225. (查阅网上资料,不确定本文献类型是否正确,请确认).
[26]	STAFEEV A, RECKTENWALD T, DE STEFANO G, et al. YuraScanner: Leveraging LLMs for task-driven web app scanning[C]. The Network and Distributed System Security (NDSS) Symposium 2025, San Diego, USA, 2025: 11–22.
[27]	VAN DER MAATEN L and HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579–2605.
[28]	DEVLIN J, CHANG Mingwei, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018. doi: 10.48550/arXiv.1810.04805. (查阅网上资料,不确定本文献类型是否正确,请确认).