LLM4CodeSec: A Framework for Evaluating the Effectiveness of Large Language Models in Source Code Vulnerability Detection

Kirill Gladkikh, Alexander Zakharov, Irina Zakharova

Abstract


In the context of a sustained increase in software vulnerabilities and the growing adoption of large language models (LLMs) for source code analysis, objective evaluation of their effectiveness in vulnerability detection tasks becomes increasingly important. Despite a substantial body of research in this area, most existing studies focus on specific programming languages, limited datasets, or proprietary models, which hinders reproducibility and comparability. This paper presents a LLM4CodeSec framework for the comprehensive evaluation of large language models in source code vulnerability detection tasks. The framework is implemented in Python and provides a unified infrastructure for reproducible experiments with various language models, datasets, prompting strategies, and evaluation metrics. Its architecture is based on object-oriented design principles and ensures extensibility without modifying the system core. The framework supports binary and multiclass classification, classification by Common Weakness Enumeration (CWE) types, and risk-related metrics, including false negative rate and inference time. The functionality of the proposed solution is validated through experimental evaluation on several widely used source code vulnerability benchmarks. The obtained results demonstrate the applicability of the framework for both research and practical software security analysis tasks, including integration within Continuous Integration (CI) pipelines. The source code of framework available at: https://github.com/vodkar/llm4codesec-framework.


Full Text:

PDF

References


A. A. Zakharov and K. I. Gladkikh, Characteristics and trends of zero-day vulnerabilities in open-source code. International Russian Automation Conference, 2024, pp. 498–502. doi: 10.1109/rusautocon61949.2024.10694228.

The MITRE Corporation, CVE, “Metrics” [Online]. Available: https://www.cve.org/about/Metrics.

J. Leinonen et al., “Comparing code explanations created by students and large language models,” ITiCSE 2023: Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, pp. 124–130, Jun. 2023, doi: 10.1145/3587102.3588785.

B. Berabi, A. Gronskiy, V. Raychev, G. Sivanrupan, V. Chibotaru, and M. Vechev, “DeepCode AI Fix: fixing security vulnerabilities with large language models,” arXiv.org, Feb. 19, 2024. https://arxiv.org/abs/2402.13291

H. Y. Lin, C. Liu, H. Gao, P. Thongtanunam, and C. Treude, “CodeReviewQA: the code review comprehension assessment for large language models,” arXiv.org, Mar. 20, 2025. https://arxiv.org/abs/2503.16167

Y. Yu et al., “Fine-tuning large language models to improve accuracy and comprehensibility of automated code review,” ACM Transactions on Software Engineering and Methodology, vol. 34, no. 1, pp. 1–26, Sep. 2024, doi: 10.1145/3695993.

D. Namiot, “"What LLM knows about cybersecurity.” International Journal of Open Information Technologies 13.7 (2025): 37-46. (in Russian).

H. Xu et al., “Large language models for cyber security: a systematic literature review,” ACM Transactions on Software Engineering and Methodology, Sep. 2025, doi: 10.1145/3769676.

C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, Transformer-based language models for software vulnerability detection. 2022, pp. 481–496. doi: 10.1145/3564625.3567985.

Z. Li et al., “VulDeePecker: a deep learning-based system for vulnerability detection,” Internet Society, Jan. 2018, doi: 10.14722/ndss.2018.23158.

C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, “Transformer-based language models for software vulnerability detection,” Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC ’22), pp. 481–496, Dec. 2022, doi: 10.1145/3564625.3567985.

N. Ziems and S. Wu, “Security vulnerability detection using deep learning natural language processing,” IEEE Conference on Computer Communications Workshops, pp. 1–6, May 2021, doi: 10.1109/infocomwkshps51825.2021.9484500.

K. Gladkikh and A. A. Zakharov, Approach to forming vulnerability datasets for fine-tuning AI agents. 2025 International Russian Smart Industry Conference (SmartIndustryCon), 2025, pp. 771–776. doi: 10.1109/smartindustrycon65166.2025.10986048.

Z. Gao, H. Wang, Y. Zhou, W. Zhu, and C. Zhang, “How far have we gone in vulnerability detection using large language models,” arXiv.org, Nov. 21, 2023. https://arxiv.org/abs/2311.12420

Z. Sheng, Z. Chen, S. Gu, H. Huang, G. Gu, and J. Huang, “LLMs in Software Security: A survey of Vulnerability Detection Techniques and Insights,” ACM Computing Surveys, vol. 58, no. 5, pp. 1–35, Sep. 2025, doi: 10.1145/3769082.

T. Chen, Challenges and opportunities in integrating LLMs into continuous integration/continuous deployment (CI/CD) pipelines. 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), 2024, pp. 364–367. doi: 10.1109/ainit61980.2024.10581784.

W. Cheng, K. Sun, X. Zhang, and W. Wang, “Security attacks on LLM-based code completion tools,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, pp. 23669–23677, Apr. 2025, doi: 10.1609/aaai.v39i22.34537.

S. Jenko, N. Mündler, J. He, M. Vero, and M. Vechev, “Black-box adversarial attacks on LLM-based code completion,” arXiv (Cornell University), Aug. 2024, doi: 10.48550/arxiv.2408.02509.

D. Noever, “Can large language models find and fix vulnerable software?,” arXiv.org, Aug. 20, 2023. https://arxiv.org/abs/2308.10345

A. Lekssays, H. Mouhcine, K. Tran, T. Yu, and I. Khalil, “LLMXCPG: Context-Aware vulnerability detection through Code Property Graph-Guided Large Language Models,” arXiv.org, Jul. 22, 2025. https://arxiv.org/abs/2507.16585

Z. Sun et al., “Ensembling large language models for code vulnerability detection: an Empirical evaluation,” arXiv.org, Sep. 16, 2025. https://arxiv.org/abs/2509.12629

M. A. Hannan, R. Ni, C. Zhang, L. Jia, R. Mangal, and C. S. Pasareanu, “On the Difficulty of Selecting Few-Shot Examples for Effective LLM-based Vulnerability Detection,” arXiv.org, Oct. 2025, doi: 10.14722/last-x.2026.23025.

W. Charoenwet, K. Tantithamthavorn, P. Thongtanunam, H. Y. Lin, M. Jeong, and M. Wu, “AgenticSCR: an autonomous agentic secure code review for immature vulnerabilities detection,” arXiv.org, Jan. 27, 2026. https://arxiv.org/abs/2601.19138

A. Zibaeirad and M. Vieira, “VULNLLMEVAL: A framework for evaluating large language models in software vulnerability detection and patching,” arXiv (Cornell University), Sep. 2024, doi: 10.48550/arxiv.2409.10756.

C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, Transformer-Based Language Models for Software Vulnerability Detection. Asia-Pacific Computer Systems Architecture Conference, 2022, pp. 481–496. doi: 10.1145/3564625.3567985.

Y. Li et al., “Everything you wanted to know about LLM-based vulnerability detection but were afraid to ask,” arXiv (Cornell University), Apr. 2025, doi: 10.48550/arxiv.2504.13474.

K. I. Gladkikh and A. A. Zakharov, Comparison of language models for source code vulnerability classification. 2025 International Russian Automation Conference (RusAutoCon), 2025, pp. 779–784. doi: 10.1109/rusautocon65989.2025.11177346.

A. Neyer, F. F. Wu, and K. Imhof, “Object-oriented programming for flexible software: example of a load flow,” IEEE Transactions on Power Systems, vol. 5, no. 3, pp. 689–696, Jan. 1990, doi: 10.1109/59.65895.

PedregosaFabian et al., “SciKit-Learn: Machine Learning in Python,” Journal of Machine Learning Research, pp. 2825–2830, Nov. 2011, doi: 10.5555/1953048.2078195.

T. Wolf et al., “HuggingFace’s transformers: state-of-the-art natural language processing,” arXiv.org, Oct. 09, 2019. https://arxiv.org/abs/1910.03771

D. Guo et al., “DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,” Nature, vol. 645, no. 8081, pp. 633–638, Sep. 2025, doi: 10.1038/s41586-025-09422-z.

A. Yang et al., “QWEN3 technical report,” arXiv.org, May 14, 2025. https://arxiv.org/abs/2505.09388

Y. Liu et al., “VulDetectBench: evaluating the deep capability of vulnerability detection with large language models,” arXiv.org, Jun. 11, 2024. https://arxiv.org/abs/2406.07595

J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao, “FlashAttention-3: fast and accurate attention with asynchrony and low-precision,” arXiv.org, Jul. 11, 2024. https://arxiv.org/abs/2407.08608


Refbacks

  • There are currently no refbacks.


Abava  Кибербезопасность Monetec 2026 СНЭ

ISSN: 2307-8162