高级搜索

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

MCL-PhishNet:基于多模态对比学习的钓鱼URL检测研究

董庆伟 付雪廷 张本奎

董庆伟, 付雪廷, 张本奎. MCL-PhishNet:基于多模态对比学习的钓鱼URL检测研究[J]. 电子与信息学报. doi: 10.11999/JEIT250758
引用本文: 董庆伟, 付雪廷, 张本奎. MCL-PhishNet:基于多模态对比学习的钓鱼URL检测研究[J]. 电子与信息学报. doi: 10.11999/JEIT250758
DONG Qingwei, FU Xueting, ZHANG Benkui. MCL-PhishNet: A Multi-Modal Contrastive Learning Network for Phishing URL Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250758
Citation: DONG Qingwei, FU Xueting, ZHANG Benkui. MCL-PhishNet: A Multi-Modal Contrastive Learning Network for Phishing URL Detection[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250758

MCL-PhishNet:基于多模态对比学习的钓鱼URL检测研究

doi: 10.11999/JEIT250758 cstr: 32379.14.JEIT250758
详细信息
    通讯作者:

    张本奎 zhangbk@aircas.ac.cn

MCL-PhishNet: A Multi-Modal Contrastive Learning Network for Phishing URL Detection

  • 摘要: 随着网络钓鱼攻击的复杂性和动态性日益加剧,传统检测方法在对抗新型攻击时面临特征维度虚高、多模态失配及对抗样本鲁棒性不足等挑战。本文提出多模态对比学习框架MCL-PhishNet(Multi-modal Contrastive Learning Phishing Network),通过层次化语法编码器、双向跨模态注意力机制和课程对比学习策略,实现钓鱼URL的精准检测。其中,多尺度残差卷积与Transformer协同建模了URL的局部语法模式和全局依赖关系,17维统计特征增强对抗样本的鲁棒性;动态对比学习机制通过在线谱聚类划分语义子空间,结合边界间隔约束优化特征空间分布。实验表明,MCL-PhishNet 在EBUU17、PhishStorm等数据集上实现了99.41%的准确率和99.65%的F1值,显著优于传统机器学习与深度学习方法。该方法为动态对抗攻击检测提供了端到端的技术范式。
  • 图  1 

    图  2  Beta分布的课程学习权重调度曲线示例

    图  3  不同长度的URL特征提取示例

    图  4  基于评估标准的不同模块下的消融实验结果

    图  5  基于准确率指标的不同模型的最好检测效果比较

    图  6  基于准确率指标的不同模型的最好检测效果比较

    表  1  不同方法在Kaggle URL数据集上的性能对比(%)

    方法准确率精确率F1值
    LR58.8399.0041.74
    DT95.4195.8095.91
    RF96.7796.7397.12
    NB88.3994.9288.96
    SVM71.8096.3465.67
    VAE-DNN97.4597.0296.54
    LR+SVC+DT98.1297.3195.89
    PDSMV3-DCRNN99.0599.0299.00
    MCL-PhishNet99.3099.2899.65
    下载: 导出CSV

    A.1  17维URL 统计特征构成及创新点说明

    序号 特征名称 计算方法 创新点说明
    传统特征(8维)
    1 URL长度 $ L=len\left(url\right) $ 沿用已有特征
    2 域名长度 $ {L}_{d}=len\left(domain\right) $ 沿用已有特征
    3 路径深度 $ {D}_{p}=count(\mathrm{\text{'}}/\mathrm{\text{'}}) $ 沿用已有特征
    4 查询参数数量 $ {N}_{q}=count(\mathrm{\text{'}}=\mathrm{\text{'}}) $ 沿用已有特征
    5 特殊字符数量 $ {N}_{s}=count\left(\right\{\mathrm{\text{'}}-\mathrm{\text{'}},\mathrm{\text{'}}\_\mathrm{\text{'}},\mathrm{\text{'}}@\mathrm{\text{'}}\left\}\right) $ 沿用已有特征
    6 数字占比 $ {R}_{d}=count\left(digits\right)\mathrm{ }/L $ 沿用已有特征
    7 HTTPS标识 $ {\mathbb{I}}_{https}\in \left\{\mathrm{0,1}\right\} $ 沿用已有特征
    8 子域名数量 $ {N}_{sub}=count(\mathrm{\text{'}}.\mathrm{\text{'}})\mathrm{ }-\mathrm{ }1 $ 沿用已有特征
    改进特征(5维)
    9 自适应加权域名熵 $ {H}_{d}^{\mathrm{*}}=\mathrm{ }-\displaystyle\sum\nolimits _{i=1}^{{L}_{d}}{w}_{i}·p\left({c}_{i}\right)\mathrm{log}p\left({c}_{i}\right) $
    $ {w}_{i}={e}^{-\frac{\alpha \left(i-1\right)}{{L}_{d}}} $
    创新1:引入位置衰减权重前缀字符权重更高
    10 路径语义密度 $ {\rho }_{p}=\dfrac{{\Sigma }_{w\in path}\mathbb{I}\left(w\in {V}_{sens}\right)}{{N}_{words}} $
    $ \mathrm{敏}\mathrm{感}\mathrm{词}\mathrm{库}:\{login,verify,secure,account\} $
    创新2:敏感词占比$ {V}_{sens} $为预定义敏感词库
    11 参数异常度 $ Aq=\mathrm{ }(1/{N}_{q})\displaystyle\sum\nolimits_{i=1}^{{N}_{q}}\mathbb{I}\left(len\right({v}_{i}) > 20) $ 创新3:长参数值占比检测重定向URL隐藏
    12 跳转链深度 $ {D}_{jump}=count\left(redirects\right) $ 创新4:通过HEAD请求检测识别动态跳转攻击
    13 证书可信度 $ {T}_{cert}\in \left[\mathrm{0,1}\right] $
    基于有效期、颁发机构、域名匹配度
    创新5:SSL证书有效性评分综合多维度计算
    新增特征(4维)
    14 字符替换相似度 $ {S}_{char}={\mathrm{max}}_{b \epsilon \mathcal{B}}sim\left(d,b\right) $
    $ sim\left(d,b\right)= 1 -\dfrac{ED\left(d,b\right)}{max\left(\left|d\right|,\left|b\right|\right)} $
    创新6:针对混淆攻击设计与已知品牌域名的编辑距离
    15 品牌名称匹配度 $ {M}_{brand}={\mathrm{m}\mathrm{a}\mathrm{x}}_{b \epsilon \mathcal{B}}\dfrac{LCS\left(d,b\right)}{\left|b\right|} $
    $ \mathcal{B}:AlexaTop1000\mathrm{品}\mathrm{牌}\mathrm{域}\mathrm{名}\mathrm{库} $
    创新7:最长公共子序列比检测品牌仿冒
    16 URL片段熵差异 $ \Delta H= |{H}_{domain}-{H}_{path}| $ 创新8:域名与路径熵值差检测随机路径混淆
    17 域名注册时长 $ {T}_{age}={T}_{now}-{T}_{reg}\left(\mathrm{天}\right) $ 创新9:WHOIS查询获取新注册域名(<7天)为高风险
    下载: 导出CSV
  • [1] LIU Ruitong, WANG Yanbin, XU Haitao, et al. PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network[J]. Information Fusion, 2025, 113: 102638. doi: 10.1016/j.inffus.2024.102638.
    [2] 钟文康, 王添, 张功萱. 基于组件分割的钓鱼URL检测方法[J]. 信息安全学报, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10.

    ZHONG Wenkang, WANG Tian, and ZHANG Gongxuan. Phishing URL detection method based on component segmentation[J]. Journal of Cyber Security, 2025, 10(1): 130–142. doi: 10.19363/J.cnki.cn10-1380/tn.2025.01.10.
    [3] JAIN A K and GUPTA B B. A survey of phishing attack techniques, defence mechanisms and open research challenges[J]. Enterprise Information Systems, 2022, 16(4): 527–565. doi: 10.1080/17517575.2021.1896786.
    [4] OMOLARA A E and ALAWIDA M. DaE2: Unmasking malicious URLs by leveraging diverse and efficient ensemble machine learning for online security[J]. Computers & Security, 2025, 148: 104170. doi: 10.1016/j.cose.2024.104170.
    [5] PANDEY P and MISHRA N. Phish-sight: A new approach for phishing detection using dominant colors on web pages and machine learning[J]. International Journal of Information Security, 2023, 22(4): 881–891. doi: 10.1007/s10207-023-00672-4.
    [6] CHEN Qisheng and OMOTE K. An intrinsic evaluator for embedding methods in malicious URL detection[J]. International Journal of Information Security, 2025, 24(1): 36. doi: 10.1007/s10207-024-00950-9.
    [7] 文伟平, 朱一帆, 吕子晗, 等. 针对品牌的网络钓鱼扩线与检测方案[J]. 信息网络安全, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001.

    WEN Weiping, ZHU Yifan, LYU Zihan, et al. Brand-specific phishing expansion and detection solutions[J]. Netinfo Security, 2023, 23(12): 1–9. doi: 10.3969/j.issn.1671-1122.2023.12.001.
    [8] 胡忠义, 张硕果, 吴江. 基于URL多粒度特征融合的钓鱼网站识别[J]. 数据分析与知识发现, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141.

    HU Zhongyi, ZHANG Shuoguo, and WU Jiang. Identifying phishing websites based on URL multi-granularity feature fusion[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 103–110. doi: 10.11925/infotech.2096-3467.2022.0141.
    [9] SABIR B, BABAR M A, GAIRE R, et al. Reliability and robustness analysis of machine learning based phishing URL detectors[J]. IEEE Transactions on Dependable and Secure Computing, 2022. doi: 10.1109/TDSC.2022.3218043. (查阅网上资料,未找到卷期页码信息,请确认补充).
    [10] DO N Q, SELAMAT A, FUJITA H, et al. An integrated model based on deep learning classifiers and pre-trained transformer for phishing URL detection[J]. Future Generation Computer Systems, 2024, 161: 269–285. doi: 10.1016/j.future.2024.06.031.
    [11] ASIRI S, XIAO Yang, ALZAHRANI S, et al. PhishingRTDS: A real-time detection system for phishing attacks using a deep learning model[J]. Computers & Security, 2024, 141: 103843. doi: 10.1016/j.cose.2024.103843.
    [12] OPARA C, CHEN Yingke, and WEI Bo. Look before you leap: Detecting phishing web pages by exploiting raw URL and HTML characteristics[J]. Expert Systems with Applications, 2024, 236: 121183. doi: 10.1016/j.eswa.2023.121183.
    [13] 谢丽霞, 张浩, 杨宏宇, 等. 网络钓鱼检测研究综述[J]. 电子科技大学学报, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273.

    XIE Lixia, ZHANG Hao, YANG Hongyu, et al. A review of phishing detection research[J]. Journal of University of Electronic Science and Technology of China, 2024, 53(6): 883–899. doi: 10.12178/1001-0548.2023273.
    [14] DU Yuefeng, DUAN Huayi, XU Lei, et al. PEBA: Enhancing user privacy and coverage of safe browsing services[J]. IEEE Transactions on Dependable and Secure Computing, 2023, 20(5): 4343–4358. doi: 10.1109/TDSC.2022.3204767.
    [15] 胡强, 刘倩, 周杭霞. 基于改进Stacking策略的钓鱼网站检测研究[J]. 广西师范大学学报: 自然科学版, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201.

    HU Qiang, LIU Qian, and ZHOU Hangxia. Study on phishing website detection based on improved Stacking strategy[J]. Journal of Guangxi Normal University: Natural Science Edition, 2022, 40(3): 132–140. doi: 10.16088/j.issn.1001-6600.2021071201.
    [16] 杨鹏, 曾朋, 赵广振, 等. 基于Logistic回归和XGBoost的钓鱼网站检测方法[J]. 东南大学学报: 自然科学版, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001.

    YANG Peng, ZENG Peng, ZHAO Guangzhen, et al. Phishing website detection method based on Logistic regression and XGBoost[J]. Journal of Southeast University: Natural Science Edition, 2019, 49(2): 207–212. doi: 10.3969/j.issn.1001-0505.2019.02.001.
    [17] SAHINGOZ O K, BUBER E, DEMIR O, et al. Machine learning based phishing detection from URLs[J]. Expert Systems with Applications, 2019, 117: 345–357. doi: 10.1016/j.eswa.2018.09.029.
    [18] 卜佑军, 张桥, 陈博, 等. 基于CNN和BiLSTM的钓鱼URL检测技术研究[J]. 郑州大学学报: 工学版, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022.

    BU Youjun, ZHANG Qiao, CHEN Bo, et al. Research on phishing URL detection technology based on CNN-BiLSTM[J]. Journal of Zhengzhou University: Engineering Science, 2021, 42(6): 14–20. doi: 10.13705/j.issn.1671-6833.2021.04.022.
    [19] 张鹏, 孙博文, 李唯实, 等. 基于LSTM的钓鱼邮件检测系统[J]. 北京理工大学学报, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262.

    ZHANG Peng, SUN Bowen, LI Weishi, et al. Phishing mail detection system based on LSTM neural network[J]. Transactions of Beijing Institute of Technology, 2020, 40(12): 1289–1294. doi: 10.15918/j.tbit1001-0645.2019.262.
    [20] AKÇAM Ö Ş, TEKEREK A, and TEKEREK M. Development of BiLSTM deep learning model to detect URL-based phishing attacks[J]. Computers and Electrical Engineering, 2025, 123: 110212. doi: 10.1016/j.compeleceng.2025.110212.
    [21] PRASAD Y B and DONDETI V. PDSMV3-DCRNN: A novel ensemble deep learning framework for enhancing phishing detection and URL extraction[J]. Computers & Security, 2025, 148: 104123. doi: 10.1016/j.cose.2024.104123.
    [22] 张重生, 陈杰, 李岐龙, 等. 深度对比学习综述[J]. 自动化学报, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.

    ZHANG Chongsheng, CHEN Jie, LI Qilong, et al. Deep contrastive learning: A survey[J]. Acta Automatica Sinica, 2023, 49(1): 15–39. doi: 10.16383/j.aas.c220421.
    [23] 侯明泽, 饶蕾, 范光宇, 等. 基于课程学习的跨度级方面情感三元组提取[J]. 浙江大学学报: 工学版, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008.

    HOU Mingze, RAO Lei, FAN Guangyu, et al. Span-level aspect sentiment triplet extraction based on curriculum learning[J]. Journal of Zhejiang University: Engineering Science, 2025, 59(1): 79–88. doi: 10.3785/j.issn.1008-973X.2025.01.008.
    [24] JAMES J, SANDHYA L, and THOMAS C. Detection of phishing URLs using machine learning techniques[C]. 2013 International Conference on Control Communication and Computing, Thiruvananthapuram, India, 2013: 304–309. doi: 10.1109/ICCC.2013.6731669. (查阅网上资料,标黄信息不确定,请确认).
    [25] TYAGI I, SHAD J, SHARMA S, et al. A novel machine learning approach to detect phishing websites[C]. 5th International Conference on Signal Processing and Integrated Networks, Noida, India, 2018: 425–430. doi: 10.1109/SPIN.2018.8474040.
    [26] PATIL V, THAKKAR P, SHAH C, et al. Detection and prevention of phishing websites using machine learning approach[C]. 4th International Conference on Computing Communication Control and Automation, Pune, India, 2018: 1–5. doi: 10.1109/ICCUBEA.2018.8697412.
    [27] LI Yukun, YANG Zhenguo, CHEN Xu, et al. A stacking model using URL and HTML features for phishing webpage detection[J]. Future Generation Computer Systems, 2019, 94: 27–39. doi: 10.1016/j.future.2018.11.004.
    [28] ABDELHAMID N, THABTAH F, and ABDEL-JABER H. Phishing detection: A recent intelligent machine learning comparison based on models content and features[C]. 2017 International Conference on Intelligence and Security Informatics, Beijing, China, 2017: 72–77. doi: 10.1109/ISI.2017.8004877.
    [29] JAGADEESAN S, CHATURVEDI A, and KUMAR S. URL phishing analysis using random forest[J]. International Journal of Pure and Applied Mathematics, 2018, 118(20): 4159–4163.
    [30] CHIEW K L, TAN C L, WONG K S, et al. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system[J]. Information Sciences, 2019, 484: 153–166. doi: 10.1016/j.ins.2019.01.064.
    [31] BOZKIR A S, DALGIC F C, and AYDOS M. GramBeddings: A new neural network for URL based identification of phishing web pages through N-gram embeddings[J]. Computers & Security, 2023, 124: 102964. doi: 10.1016/j.cose.2022.102964.
    [32] PRABAKARAN M K, SUNDARAM P M, and CHANDRASEKAR A D. An enhanced deep learning-based phishing detection mechanism to effectively identify malicious URLs using variational autoencoders[J]. IET Information Security, 2023, 17(3): 423–440. doi: 10.1049/ise2.12106.
  • 加载中
图(6) / 表(2)
计量
  • 文章访问数:  27
  • HTML全文浏览量:  10
  • PDF下载量:  1
  • 被引次数: 0
出版历程
  • 修回日期:  2025-12-03
  • 录用日期:  2025-12-03
  • 网络出版日期:  2025-12-09

目录

    /

    返回文章
    返回