A Cybersecurity Entity Recognition Approach Based on Character Representation Learning and Temporal Boundary Diffusion

HU Ze; LI Wenjun; YANG Hongyu

doi:10.11999/JEIT240953

Volume 47 Issue 5

May 2025

Turn off MathJax

Article Contents

Article Navigation > Journal of Electronics & Information Technology > 2025 > 47(5): 1554-1568

HU Ze, LI Wenjun, YANG Hongyu. A Cybersecurity Entity Recognition Approach Based on Character Representation Learning and Temporal Boundary Diffusion[J]. Journal of Electronics & Information Technology, 2025, 47(5): 1554-1568. doi: 10.11999/JEIT240953

Citation:

HU Ze, LI Wenjun, YANG Hongyu. A Cybersecurity Entity Recognition Approach Based on Character Representation Learning and Temporal Boundary Diffusion[J]. Journal of Electronics & Information Technology, 2025, 47(5): 1554-1568. doi: 10.11999/JEIT240953

Citation:

PDF( 3956 KB)

A Cybersecurity Entity Recognition Approach Based on Character Representation Learning and Temporal Boundary Diffusion

doi: 10.11999/JEIT240953 cstr: 32379.14.JEIT240953

HU Ze¹,
LI Wenjun¹,
YANG Hongyu^{1, 2
,
,}

1.
School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300, China
2.
School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

Funds: The National Natural Science Foundation of China (62201576, U1833107), The Supporting Fund of the National Natural Science Foundation of China (3122023PT10)

Received Date: 2024-10-28
Rev Recd Date: 2025-02-10

Available Online: 2025-02-19

Publish Date: 2025-05-01

Abstract

Abstract

Objective The vast amount of unstructured cybersecurity information available online holds significant value. Named Entity Recognition (NER) in cybersecurity facilitates the automatic extraction of such information, providing a foundation for cyber threat analysis and knowledge graph construction. However, existing cybersecurity NER research remains limited, primarily relying on general-purpose approaches that struggle to generalize effectively to domain-specific datasets, often resulting in errors when recognizing cybersecurity-specific terms. Some recent studies decompose the NER task into entity boundary detection and entity classification, optimizing these subtasks separately to enhance performance. However, the representation of complex cybersecurity entities often exceeds the capability of single-feature semantic representations, and existing boundary detection methods frequently produce misjudgments. To address these challenges, this study proposes a cybersecurity entity recognition approach based on character representation learning and temporal boundary diffusion. The approach integrates character-level feature extraction with a boundary diffusion network based on a denoising diffusion probabilistic model. By focusing on optimizing entity boundary detection, the proposed method improves performance in cybersecurity NER tasks. Methods The proposed approach divides the NER task into two subtasks: entity boundary detection and entity classification, which are processed independently, as illustrated (Fig. 1). For entity boundary detection, a Question-Answering (QA) framework is adopted. The framework first generates questions about the entities to be extracted, concatenates them with the corresponding input sentences, and encodes them using a pre-trained BERT model to extract preliminary semantic features. Character-level feature extraction is then performed using a Dilated Convolutional Residual Character Network (DCR-CharNet), which processes character-level information through dilated residual blocks. Dilated convolution expands the model’s receptive field, capturing broader contextual information, while a self-attention mechanism dynamically identifies key features. These components enhance the global representation of input data and provide multi-dimensional feature representations. A Temporal Boundary Diffusion Network (TBDN) is then applied for entity boundary detection. TBDN employs a fixed forward diffusion process that introduces Gaussian noise to entity boundaries at each time step, progressively blurring them. A learnable reverse diffusion process subsequently predicts and removes noise at each time step, enabling the gradual recovery of accurate entity boundaries and leading to precise boundary detection. For entity classification, an independent network is trained to assign labels to detected entities. Like boundary detection, this subtask also adopts a QA framework. A cybersecurity-specific pre-trained language model, SecRoBERTa, encodes the concatenated question and input data to extract entity classification features. These features are then processed through a linear-layer-based entity classifier, which outputs the recognized entity type. Results and Discussions The performance of the proposed approach is evaluated on the DNRTI cybersecurity dataset, with comparative results against baseline methods presented (Table 3). The proposed approach achieved a 0.40% improvement in F1-score over UTERMMF, a model incorporating character-level, part-of-speech, and positional features along with inter-word relationship classification. Compared to CTERMRFRAT, which employs an adversarial training framework, the proposed approach improved the F1-score by 1.65%. Additionally, it outperformed BERT+BiLSTM+CRF by 5.20% and achieved gains of 12.21%, 17.90%, and 18.31% over BERT, CNN+BiLSTM+CRF, and IDCNN+CRF, respectively. These results highlight that boundary detection accuracy is a key factor limiting NER performance, and optimizing boundary detection methods can significantly enhance overall model effectiveness. The proposed approach’s emphasis on boundary detection enables more accurate identification of entity boundaries, contributing to higher F1-scores. However, in terms of accuracy, it slightly underperforms CNN+BiLSTM+CRF. This discrepancy is attributed to class imbalance in the dataset, where certain categories are overrepresented while others are underrepresented. The approach demonstrates strong performance in handling minority categories, but its focus on rare entities slightly reduces prediction accuracy for common categories, affecting overall accuracy. Despite this trade-off, the approach enhances entity boundary detection, reducing misidentifications and improving precision and recall, thereby increasing the F1-score. Errors in boundary detection may propagate to the entity classification stage, impacting overall accuracy. However, the proposed two-stage approach, which prioritizes boundary detection optimization, ensures more precise boundary identification, which is crucial for improving NER performance. In terms of computational efficiency, the proposed approach is compared with DiffusionNER (Table 4), another diffusion-based NER model. Results indicate that the proposed approach requires fewer parameters, achieves faster inference speeds, and delivers higher F1-scores under the same hardware and software conditions. Conclusions Enhancing boundary detection efficiency significantly improves NER performance. The proposed approach reduces resource consumption while achieving superior performance compared to recent baseline methods in cybersecurity NER tasks.
- Named Entity Recognition (NER),
- Cybersecurity,
- Boundary detection,
- Deep learning,
- Natural language processing

FullText(HTML)

References(36)

References

[1]	ZHOU Diange, LI Shengwen, CHEN Qizhi, et al. Improving few-shot named entity recognition via semantics induced optimal transport[J]. Neurocomputing, 2024, 597: 127938. doi: 10.1016/j.neucom.2024.127938.
[2]	XU Yingjie, TAN Xiaobo, TONG Xin, et al. A robust Chinese named entity recognition method based on integrating dual-layer features and CSBERT[J]. Applied Sciences, 2024, 14(3): 1060. doi: 10.3390/app14031060.
[3]	MA Pingchuan, JIANG Bo, LU Zhigang, et al. Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields[J]. Tsinghua Science and Technology, 2021, 26(3): 259–265. doi: 10.26599/tst.2019.9010033.
[4]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, USA, 2019: 4171–4186. doi: 10.18653/v1/N19-1423.
[5]	GAO Chen, ZHANG Xuan, HAN Mengting, et al. A review on cyber security named entity recognition[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(9): 1153–1168. doi: 10.1631/FITEE.2000286.
[6]	YU Junhui, CHEN Yanping, ZHENG Qinghua, et al. Full-span named entity recognition with boundary regression[J]. Connection Science, 2023, 35(1): 2181483. doi: 10.1080/09540091.2023.2181483.
[7]	ZHA Enze, ZENG Delong, LIN Man, et al. CEPTNER: Contrastive learning enhanced prototypical network for two-stage few-shot named entity recognition[J]. Knowledge-Based Systems, 2024, 295: 111730. doi: 10.1016/j.knosys.2024.111730.
[8]	WANG Xiaodi and LIU Jiayong. A novel feature integration and entity boundary detection for named entity recognition in cybersecurity[J]. Knowledge-Based Systems, 2023, 260: 110114. doi: 10.1016/j.knosys.2022.110114.
[9]	EFTIMOV T, KOROUŠIĆ SELJAK B, and KOROŠEC P. A rule-based named-entity recognition method for knowledge extraction of evidence-based dietary recommendations[J]. PLoS One, 2017, 12(6): e0179488. doi: 10.1371/journal.pone.0179488.
[10]	HU Chenxi, WU Tao, LIU Chunsheng, et al. Joint contrastive learning and belief rule base for named entity recognition in cybersecurity[J]. Cybersecurity, 2024, 7(1): 19. doi: 10.1186/s42400-024-00206-y.
[11]	FREITAG D, CADIGAN J, SASSEEN R, et al. Valet: Rule-based information extraction for rapid deployment[C]. The Thirteenth Language Resources and Evaluation Conference, Marseille, France, 2022: 524–533.
[12]	SARI Y, HASSAN M F, and ZAMIN N. Rule-based pattern extractor and named entity recognition: A hybrid approach[C]. 2010 International Symposium on Information Technology, Kuala Lumpur, Malaysia, 2010: 563–568. doi: 10.1109/ITSIM.2010.5561392.
[13]	MCCALLUM A and LI Wei. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons[C]. The Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, Canada, 2003: 188–191. doi: 10.3115/1119176.1119206.
[14]	SASAKI Y, TSURUOKA Y, MCNAUGHT J, et al. How to make the most of NE dictionaries in statistical NER[J]. BMC Bioinformatics, 2008, 9(S11): S5. doi: 10.1186/1471-2105-9-S11-S5.
[15]	PASSOS A, KUMAR V, and MCCALLUM A. Lexicon infused phrase embeddings for named entity resolution[C]. The Eighteenth Conference on Computational Natural Language Learning, Ann Arbor, USA, 2014: 78–86. doi: 10.3115/v1/W14-1609.
[16]	CECCHINI F M and FERSINI E. Named entity recognition using conditional random fields with non-local relational constraints[EB/OL]. https://arxiv.org/abs/1310.1964, 2013.
[17]	BANERJEE S, DUTTA A, AGRAWAL A, et al. DistALANER: Distantly supervised active learning augmented named entity recognition in the open source software ecosystem[C]. European Conference on Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track, Vilnius, Lithuania, 2024: 313–331. doi: 10.1007/978-3-031-70381-2_20.
[18]	MING Hong, YANG Jiaoyun, GUI Fang, et al. Few-shot nested named entity recognition[J]. Knowledge-Based Systems, 2024, 293: 111688. doi: 10.1016/j.knosys.2024.111688.
[19]	WANG Hao, ZHOU Lekai, DUAN Jianyong, et al. Cross-lingual named entity recognition based on attention and adversarial training[J]. Applied Sciences, 2023, 13(4): 2548. doi: 10.3390/app13042548.
[20]	YU Jie, KONG Wenya, and LIU Fangfang. CeER: A nested name entity recognition model incorporating gaze feature[C]. 8th International Joint Conference on Web and Big Data, Jinhua, China, 2024: 32–45. doi: 10.1007/978-981-97-7232-2_3.
[21]	YANG Kang, YANG Zhiwei, ZHAO Songwei, et al. Uncertainty-aware contrastive learning for semi-supervised named entity recognition[J]. Knowledge-Based Systems, 2024, 296: 111762. doi: 10.1016/j.knosys.2024.111762.
[22]	MA Xuezhe and HOVY E. End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF[C]. The 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 2016: 1064–1074. doi: 10.18653/v1/P16-1101.
[23]	ZHANG Zhen, HU Mengting, ZHAO Shiwan, et al. E-NER: Evidential deep learning for trustworthy named entity recognition[C]. Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023: 1619–1634. doi: 10.18653/v1/2023.findings-acl.103.
[24]	LENG Fangling, LI Fan, BAO Yubin, et al. DABC: A named entity recognition method incorporating attention mechanisms[J]. Mathematics, 2024, 12(13): 1992. doi: 10.3390/math12131992.
[25]	DE LICHY C, GLAUDE H, and CAMPBELL W. Meta-learning for few-shot named entity recognition[C]. The 1st Workshop on Meta Learning and Its Applications to Natural Language Processing, Bangkok, Thailand (online), 2021: 44–58. doi: 10.18653/v1/2021.metanlp-1.6.
[26]	JACKADUMA. SecRoBERTa[EB/OL]. https://huggingface.co/jackaduma/SecRoBERTa, 2024.
[27]	ARORA J and PARK Y. Split-NER: Named entity recognition via two question-answering-based classifications[C]. The 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Toronto, Canada, 2023: 416–426. doi: 10.18653/v1/2023.acl-short.36.
[28]	CHOLLET F. Xception: Deep learning with depthwise separable convolutions[C]. The 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, 2017: 1800–1807. doi: 10.1109/CVPR.2017.195.
[29]	YU F and KOLTUN V. Multi-scale context aggregation by dilated convolutions[C]. 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016. doi: 10.48550/arXiv.1511.07122.
[30]	HE Kaiming, ZHANG Xiangyu, REN Shaoqing, et al. Deep residual learning for image recognition[C]. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016: 770–778. doi: 10.1109/CVPR.2016.90.
[31]	SHEN Yongliang, SONG Kaitao, TAN Xu, et al. DiffusionNER: Boundary diffusion for named entity recognition[C]. The 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 2023: 3875–3890. doi: 10.18653/v1/2023.acl-long.215.
[32]	WANG Xuren, LIU Xinpei, AO Shengqin, et al. DNRTI: A large-scale dataset for named entity recognition in threat intelligence[C]. 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Guangzhou, China, 2020: 1842–1848. doi: 10.1109/TrustCom50675.2020.00252.
[33]	MOUICHE I and SAAD S. TI-NERmerger: Semi-automated framework for integrating NER datasets in cybersecurity[C]. The 21st International Conference on Security and Cryptography, Dijon, France, 2024: 357–370. doi: 10.5220/0012867900003767.
[34]	LIU Peipei, LI Hong, WANG Zuoguang, et al. Multi-features based semantic augmentation networks for named entity recognition in threat intelligence[C]. 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, Canada, 2022: 1557–1563. doi: 10.1109/ICPR56361.2022.9956373.
[35]	WANG Peng and LIU Jingju. A cyber threat entity recognition method based on robust feature representation and adversarial training[C]. The 2023 12th International Conference on Computing and Pattern Recognition, Qingdao, China, 2024: 255–259. doi: 10.1145/3633637.3633677.
[36]	CHANG Yu, WANG Gang, ZHU Peng, et al. Research on unified cyber threat intelligence entity recognition method based on multiple features[C]. 2023 4th International Conference on Computers and Artificial Intelligence Technology (CAIT), Macau, China, 2023: 233–240. doi: 10.1109/CAIT59945.2023.10469250.