MCL-PhishNet: A Multi-Modal Contrastive Learning Network for Phishing URL Detection
-
摘要: 随着网络钓鱼攻击的复杂性和动态性日益加剧,传统检测方法在对抗新型攻击时面临特征维度虚高、多模态失配及对抗样本鲁棒性不足等挑战。本文提出多模态对比学习框架MCL-PhishNet(Multi-modal Contrastive Learning Phishing Network),通过层次化语法编码器、双向跨模态注意力机制和课程对比学习策略,实现钓鱼URL的精准检测。其中,多尺度残差卷积与Transformer协同建模了URL的局部语法模式和全局依赖关系,17维统计特征增强对抗样本的鲁棒性;动态对比学习机制通过在线谱聚类划分语义子空间,结合边界间隔约束优化特征空间分布。实验表明,MCL-PhishNet 在EBUU17、PhishStorm等数据集上实现了99.41%的准确率和99.65%的F1值,显著优于传统机器学习与深度学习方法。该方法为动态对抗攻击检测提供了端到端的技术范式。Abstract:
Objective With the increasing complexity and dynamism of phishing attacks, traditional detection methods face challenges such as feature redundancy, multi-modal mis-match, and insufficient robustness to adversarial samples when confronting emerging attacks. Methods This paper proposes MCL-PhishNet, a multi-modal contrastive learning framework, to achieve precise phishing URL detection through a hierarchical syntactic encoder, bidirectional cross-modal attention mechanisms, and curriculum contrastive learning strategies. Specifically, multi-scale residual convolutions and Transformers collaboratively model local grammatical patterns and global de-pendency relationships of URLs, while 17-dimensional statistical features enhance robustness to adversarial samples. The dynamic contrastive learning mechanism optimizes feature space distribution via online spectral clustering-based semantic subspace partitioning and boundary margin constraints. Results and Discussions Experimental results demonstrate that MCL-PhishNet achieves an accuracy of 99.41% and an F1-score of 99.65% on datasets including EBUU17 and PhishStorm( Fig. 4 andFig. 5 ), significantly outperforming traditional machine learning and deep learning approaches.Conclusions This framework provides an end-to-end technical paradigm for detecting dynamically evolving adversarial attacks. -
表 1 不同方法在Kaggle URL数据集上的性能对比(%)
方法 准确率 精确率 F1值 LR 58.83 99.00 41.74 DT 95.41 95.80 95.91 RF 96.77 96.73 97.12 NB 88.39 94.92 88.96 SVM 71.80 96.34 65.67 VAE-DNN 97.45 97.02 96.54 LR+SVC+DT 98.12 97.31 95.89 PDSMV3-DCRNN 99.05 99.02 99.00 MCL-PhishNet 99.30 99.28 99.65 A.1 17维URL 统计特征构成及创新点说明
序号 特征名称 计算方法 创新点说明 传统特征(8维) 1 URL长度 $ L=len\left(url\right) $ 沿用已有特征 2 域名长度 $ {L}_{d}=len\left(domain\right) $ 沿用已有特征 3 路径深度 $ {D}_{p}=count(\mathrm{\text{'}}/\mathrm{\text{'}}) $ 沿用已有特征 4 查询参数数量 $ {N}_{q}=count(\mathrm{\text{'}}=\mathrm{\text{'}}) $ 沿用已有特征 5 特殊字符数量 $ {N}_{s}=count\left(\right\{\mathrm{\text{'}}-\mathrm{\text{'}},\mathrm{\text{'}}\_\mathrm{\text{'}},\mathrm{\text{'}}@\mathrm{\text{'}}\left\}\right) $ 沿用已有特征 6 数字占比 $ {R}_{d}=count\left(digits\right)\mathrm{ }/L $ 沿用已有特征 7 HTTPS标识 $ {\mathbb{I}}_{https}\in \left\{\mathrm{0,1}\right\} $ 沿用已有特征 8 子域名数量 $ {N}_{sub}=count(\mathrm{\text{'}}.\mathrm{\text{'}})\mathrm{ }-\mathrm{ }1 $ 沿用已有特征 改进特征(5维) 9 自适应加权域名熵 $ {H}_{d}^{\mathrm{*}}=\mathrm{ }-\displaystyle\sum\nolimits _{i=1}^{{L}_{d}}{w}_{i}·p\left({c}_{i}\right)\mathrm{log}p\left({c}_{i}\right) $
$ {w}_{i}={e}^{-\frac{\alpha \left(i-1\right)}{{L}_{d}}} $创新1:引入位置衰减权重前缀字符权重更高 10 路径语义密度 $ {\rho }_{p}=\dfrac{{\Sigma }_{w\in path}\mathbb{I}\left(w\in {V}_{sens}\right)}{{N}_{words}} $
$ \mathrm{敏}\mathrm{感}\mathrm{词}\mathrm{库}:\{login,verify,secure,account\} $创新2:敏感词占比$ {V}_{sens} $为预定义敏感词库 11 参数异常度 $ Aq=\mathrm{ }(1/{N}_{q})\displaystyle\sum\nolimits_{i=1}^{{N}_{q}}\mathbb{I}\left(len\right({v}_{i}) > 20) $ 创新3:长参数值占比检测重定向URL隐藏 12 跳转链深度 $ {D}_{jump}=count\left(redirects\right) $ 创新4:通过HEAD请求检测识别动态跳转攻击 13 证书可信度 $ {T}_{cert}\in \left[\mathrm{0,1}\right] $
基于有效期、颁发机构、域名匹配度创新5:SSL证书有效性评分综合多维度计算 新增特征(4维) 14 字符替换相似度 $ {S}_{char}={\mathrm{max}}_{b \epsilon \mathcal{B}}sim\left(d,b\right) $
$ sim\left(d,b\right)= 1 -\dfrac{ED\left(d,b\right)}{max\left(\left|d\right|,\left|b\right|\right)} $创新6:针对混淆攻击设计与已知品牌域名的编辑距离 15 品牌名称匹配度 $ {M}_{brand}={\mathrm{m}\mathrm{a}\mathrm{x}}_{b \epsilon \mathcal{B}}\dfrac{LCS\left(d,b\right)}{\left|b\right|} $
$ \mathcal{B}:AlexaTop1000\mathrm{品}\mathrm{牌}\mathrm{域}\mathrm{名}\mathrm{库} $创新7:最长公共子序列比检测品牌仿冒 16 URL片段熵差异 $ \Delta H= |{H}_{domain}-{H}_{path}| $ 创新8:域名与路径熵值差检测随机路径混淆 17 域名注册时长 $ {T}_{age}={T}_{now}-{T}_{reg}\left(\mathrm{天}\right) $ 创新9:WHOIS查询获取新注册域名(<7天)为高风险 -
[1] LIU Ruitong, WANG Yanbin, XU Haitao, et al. PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network[J]. Information Fusion, 2025, 113: 102638. doi: 10.1016/j.inffus.2024.102638. [2] 钟文康, 王添, 张功萱. 基于组件分割的钓鱼URL检测方法[J]. 信息安全学报, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10.ZHONG Wenkang, WANG Tian, and ZHANG Gongxuan. Phishing URL detection method based on component segmentation[J]. Journal of Cyber Security, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10. [3] JAIN A K and GUPTA B B. A survey of phishing attack techniques, defence mechanisms and open research challenges[J]. Enterprise Information Systems, 2022, 16(4): 527–565. doi: 10.1080/17517575.2021.1896786. [4] OMOLARA A E and ALAWIDA M. DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online security[J]. Computers & Security, 2025, 148: 104170. doi: 10.1016/j.cose.2024.104170. [5] PANDEY P and MISHRA N. Phish-sight: A new approach for phishing detection using dominant colors on web pages and machine learning[J]. International Journal of Information Security, 2023, 22(4): 881–891. doi: 10.1007/s10207-023-00672-4. [6] CHEN Qisheng and OMOTE K. An intrinsic evaluator for embedding methods in malicious URL detection[J]. International Journal of Information Security, 2025, 24(1): 36. doi: 10.1007/s10207-024-00950-9. [7] 文伟平, 朱一帆, 吕子晗, 等. 针对品牌的网络钓鱼扩线与检测方案[J]. 信息网络安全, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001.WEN Weiping, ZHU Yifan, LYU Zihan, et al. Brand-specific phishing expansion and detection solutions[J]. Netinfo Security, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001. [8] 胡忠义, 张硕果, 吴江. 基于URL多粒度特征融合的钓鱼网站识别[J]. 数据分析与知识发现, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141.HU Zhongyi, ZHANG Shuoguo, and WU Jiang. Identifying phishing websites based on URL multi-granularity feature fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141. [9] SABIR B, BABAR M A, GAIRE R, et al. Reliability and robustness analysis of machine learning based phishing URL detectors[J]. IEEE Transactions on Dependable and Secure Computing, 2022. doi: 10.1109/TDSC.2022.3218043. (查阅网上资料,未找到卷期页码信息,请确认补充). [10] DO N Q, SELAMAT A, FUJITA H, et al. An integrated model based on deep learning classifiers and pre-trained transformer for phishing URL detection[J]. Future Generation Computer Systems, 2024, 161: 269–285. doi: 10.1016/j.future.2024.06.031. [11] ASIRI S, XIAO Yang, ALZAHRANI S, et al. PhishingRTDS: A real-time detection system for phishing attacks using a deep learning model[J]. Computers & Security, 2024, 141: 103843. doi: 10.1016/j.cose.2024.103843. [12] OPARA C, CHEN Yingke, and WEI Bo. Look before you leap: Detecting phishing web pages by exploiting raw URL and HTML characteristics[J]. Expert Systems with Applications, 2024, 236: 121183. doi: 10.1016/j.eswa.2023.121183. [13] 谢丽霞, 张浩, 杨宏宇, 等. 网络钓鱼检测研究综述[J]. 电子科技大学学报, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273.XIE Lixia, ZHANG Hao, YANG Hongyu, et al. A review of phishing detection research[J]. Journal of University of Electronic Science and Technology of China, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273. [14] DU Yuefeng, DUAN Huayi, XU Lei, et al. PEBA: Enhancing user privacy and coverage of safe browsing services[J]. IEEE Transactions on Dependable and Secure Computing, 2023, 20(5): 4343–4358. doi: 10.1109/TDSC.2022.3204767. [15] 胡强, 刘倩, 周杭霞. 基于改进Stacking策略的钓鱼网站检测研究[J]. 广西师范大学学报: 自然科学版, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201.HU Qiang, LIU Qian, and ZHOU Hangxia. Study on phishing website detection based on improved Stacking strategy[J]. Journal of Guangxi Normal University: Natural Science Edition, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201. [16] 杨鹏, 曾朋, 赵广振, 等. 基于Logistic回归和XGBoost的钓鱼网站检测方法[J]. 东南大学学报: 自然科学版, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001.YANG Peng, ZENG Peng, ZHAO Guangzhen, et al. Phishing website detection method based on Logistic regression and XGBoost[J]. Journal of Southeast University: Natural Science Edition, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001. [17] SAHINGOZ O K, BUBER E, DEMIR O, et al. Machine learning based phishing detection from URLs[J]. Expert Systems with Applications, 2019, 117: 345–357. doi: 10.1016/j.eswa.2018.09.029. [18] 卜佑军, 张桥, 陈博, 等. 基于CNN和BiLSTM的钓鱼URL检测技术研究[J]. 郑州大学学报: 工学版, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022.BU Youjun, ZHANG Qiao, CHEN Bo, et al. Research on phishing URL detection technology based on CNN-BiLSTM[J]. Journal of Zhengzhou University: Engineering Science, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022. [19] 张鹏, 孙博文, 李唯实, 等. 基于LSTM的钓鱼邮件检测系统[J]. 北京理工大学学报, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262.ZHANG Peng, SUN Bowen, LI Weishi, et al. Phishing mail detection system based on LSTM neural network[J]. Transactions of Beijing Institute of Technology, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262. [20] AKÇAM Ö Ş, TEKEREK A, and TEKEREK M. Development of BiLSTM deep learning model to detect URL-based phishing attacks[J]. Computers and Electrical Engineering, 2025, 123: 110212. doi: 10.1016/j.compeleceng.2025.110212. [21] PRASAD Y B and DONDETI V. PDSMV3-DCRNN: A novel ensemble deep learning framework for enhancing phishing detection and URL extraction[J]. Computers & Security, 2025, 148: 104123. doi: 10.1016/j.cose.2024.104123. [22] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.ZHANG Chongsheng, CHEN Jie, LI Qilong, et al. Deep contrastive learning: A survey[J]. Acta Automatica Sinica, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421. [23] 侯明泽, 饶蕾, 范光宇, 等. 基于课程学习的跨度级方面情感三元组提取[J]. 浙江大学学报: 工学版, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008.HOU Mingze, RAO Lei, FAN Guangyu, et al. Span-level aspect sentiment triplet extraction based on curriculum learning[J]. Journal of Zhejiang University: Engineering Science, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008. [24] JAMES J, SANDHYA L, and THOMAS C. Detection of phishing URLs using machine learning techniques[C]. 2013 International Conference on Control Communication and Computing, Thiruvananthapuram, India, 2013: 304–309. doi: 10.1109/ICCC.2013.6731669. (查阅网上资料,标黄信息不确定,请确认). [25] TYAGI I, SHAD J, SHARMA S, et al. A novel machine learning approach to detect phishing websites[C]. 5th International Conference on Signal Processing and Integrated Networks, Noida, India, 2018: 425–430. doi: 10.1109/SPIN.2018.8474040. [26] PATIL V, THAKKAR P, SHAH C, et al. Detection and prevention of phishing websites using machine learning approach[C]. 4th International Conference on Computing Communication Control and Automation, Pune, India, 2018: 1–5. doi: 10.1109/ICCUBEA.2018.8697412. [27] LI Yukun, YANG Zhenguo, CHEN Xu, et al. A stacking model using URL and HTML features for phishing webpage detection[J]. Future Generation Computer Systems, 2019, 94: 27–39. doi: 10.1016/j.future.2018.11.004. [28] ABDELHAMID N, THABTAH F, and ABDEL-JABER H. Phishing detection: A recent intelligent machine learning comparison based on models content and features[C]. 2017 International Conference on Intelligence and Security Informatics, Beijing, China, 2017: 72–77. doi: 10.1109/ISI.2017.8004877. [29] JAGADEESAN S, CHATURVEDI A, and KUMAR S. URL phishing analysis using random forest[J]. International Journal of Pure and Applied Mathematics, 2018, 118(20): 4159–4163. [30] CHIEW K L, TAN C L, WONG K S, et al. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system[J]. Information Sciences, 2019, 484: 153–166. doi: 10.1016/j.ins.2019.01.064. [31] BOZKIR A S, DALGIC F C, and AYDOS M. GramBeddings: A new neural network for URL based identification of phishing web pages through N-gram embeddings[J]. Computers & Security, 2023, 124: 102964. doi: 10.1016/j.cose.2022.102964. [32] PRABAKARAN M K, SUNDARAM P M, and CHANDRASEKAR A D. An enhanced deep learning-based phishing detection mechanism to effectively identify malicious URLs using variational autoencoders[J]. IET Information Security, 2023, 17(3): 423–440. doi: 10.1049/ise2.12106. -
下载:
下载: