Adversarial testing of large language models

Yuriy Lebedinskiy; Dmitry Namiot

Adversarial testing of large language models

Yuriy Lebedinskiy, Dmitry Namiot

Abstract

Recently, artificial intelligence has made a huge leap in development, transforming almost all areas of human activity. The emergence of large language models (LLM) has opened up new opportunities for process automation, content generation and natural language processing. These technologies are already actively used in medicine, education, business and the everyday life of millions of people. However, the rapid development of technologies brings not only new opportunities, but also serious security risks. One of the most pressing threats in this area has become the use of prompt injections - a special type of attack aimed at manipulating the behavior of language models. This phenomenon demonstrates the vulnerability of even the most advanced AI systems to attempts to bypass their defense mechanisms. This work is devoted to the development of a framework for assessing the resilience of systems using large language models to prompt injections. The work includes a classification of various tactics used by attackers when creating malicious prompts. The developed testing system allows you to automate the process of checking systems using large language models for vulnerabilities to prompt injections. The implementation of the framework enables developers and researchers to evaluate and improve the defense mechanisms of language models against various types of prompt attacks.

Full Text:

PDF (Russian)

References

Mudarova, Ramina, and Dmitry Namiot. "Countering Prompt Injection attacks on large language models." International Journal of Open Information Technologies 12.5 (2024): 39-48.

Namiot, Dmitry, and Elena Zubareva. "About AI Red Team." International Journal of Open Information Technologies 11.10 (2023): 130-139.

Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models / L. Lin, H. Mu, Z. Zhai, M. Wang, Y. Wang, R. Wang, J. Gao, Y. Zhang, W. Che, T. Baldwin, X. Han, H. Li. — 2024. — Март. — DOI: 10.48550/ARXIV. 2404.00629. — eprint: 2404.00629 (cs.CL).

Raheja T., Pochhi N. Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations. — 2024. — Окт. — DOI: 10.48550 /ARXIV.2410.09097. — eprint: 2410.09097 (cs.CL).

Automated Progressive Red Teaming / B. Jiang, Y. Jing, T. Shen, T. Wu, Q. Yang, D. Xiong. — 2024. — Июль. — DOI: 10.48550/ARXIV.2407.03876. — eprint: 2407.03876 (cs.CR)

OWASP Foundation. OWASP Top 10 for LLM Applications. — 2024. https://owasp.org/www-project-top-10-for-large-language-model-applications/.

Adversarial machine learning: a taxonomy and terminology of attacks and mitigations / A. Vassilev, A. Oprea, A. Fordyce, H. Anderson. — 2024. DOI: 10.6028/nist.ai.100-2e2023.

WIPI: A New Web Threat for LLM-Driven Web Agents / F. Wu, S. Wu, Y. Cao, C. Xiao. — 2024. DOI: 10.48550/ARXIV.2402.16965. — eprint: 2402.16965 (cs.CR).

Li, Yuanchun, et al. "Personal llm agents: Insights and survey about the capability, efficiency and security." arXiv preprint arXiv:2401.05459 (2024).

Zou, Wei, et al. "Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models." arXiv preprint arXiv:2402.07867 (2024).

Deng, Zehang, et al. "Ai agents under threat: A survey of key security challenges and future pathways." ACM Computing Surveys 57.7 (2025): 1-36.

Wu, Fangzhou, Ethan Cecchetti, and Chaowei Xiao. "System-Level Defense against Indirect Prompt Injection Attacks: An Information Flow Control Perspective." arXiv preprint arXiv:2409.19091 (2024).

Ha, Junwoo, et al. "One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs." arXiv preprint arXiv:2503.04856 (2025).

Pfister, Niklas, et al. "Gandalf the red: Adaptive security for llms." arXiv preprint arXiv:2501.07927 (2025).

Schulhoff, Sander, et al. "Ignore this title and hackaprompt: Exposing systemic vulnerabilities of llms through a global scale prompt hacking competition." Association for Computational Linguistics (ACL), 2023.

Wan, Shengye, et al. "Cyberseceval 3: Advancing the evaluation of cybersecurity risks and capabilities in large language models." arXiv preprint arXiv:2408.01605 (2024).

Liu, Yi, et al. "Jailbreaking chatgpt via prompt engineering: An empirical study." arXiv preprint arXiv:2305.13860 (2023).

Rossi, Sippo, et al. "An early categorization of prompt injection attacks on large language models." arXiv preprint arXiv:2402.00898 (2024).

Huang, Ruixuan, et al. "GuidedBench: Equipping Jailbreak Evaluation with Guidelines." arXiv preprint arXiv:2502.16903 (2025).

Jin, Haibo, et al. "Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models." arXiv preprint arXiv:2407.01599 (2024).

Yi, Sibo, et al. "Jailbreak attacks and defenses against large language models: A survey." arXiv preprint arXiv:2407.04295 (2024).

Shen, Xinyue, et al. "" do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models." Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 2024.

Sitawarin, Chawin, et al. "Pal: Proxy-guided black-box attack on large language models." arXiv preprint arXiv:2402.09674 (2024).

Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

Liu, Xiaogeng, et al. "Autodan: Generating stealthy jailbreak prompts on aligned large language models." arXiv preprint arXiv:2310.04451 (2023).

Zhou, Yuqi, et al. "Virtual context: Enhancing jailbreak attacks with special token injection." arXiv preprint arXiv:2406.19845 (2024).

Huang, Brian RY. "Plentiful Jailbreaks with String Compositions." arXiv preprint arXiv:2411.01084 (2024).

Daniel, Johan S., and Anand Pal. "Impact of non-standard unicode characters on security and comprehension in large language models." arXiv preprint arXiv:2405.14490 (2024).

Jiang, Fengqing, et al. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

Ghanim, Mansour Al, et al. "Jailbreaking llms with arabic transliteration and arabizi." arXiv preprint arXiv:2406.18725 (2024).

Zhang, Tianrong, et al. "Wordgame: Efficient & effective llm jailbreak via simultaneous obfuscation in query and response." arXiv preprint arXiv:2405.14023 (2024).

Deng, Yue, et al. "Multilingual jailbreak challenges in large language models." arXiv preprint arXiv:2310.06474 (2023).

Wei, Alexander, Nika Haghtalab, and Jacob Steinhardt. "Jailbroken: How does llm safety training fail?." Advances in Neural Information Processing Systems 36 (2023): 80079-80110.

Wei, Zhipeng, Yuqi Liu, and N. Benjamin Erichson. "Emoji attack: Enhancing jailbreak attacks against judge llm detection." Forty-second International Conference on Machine Learning. 2025.

Li, Zhecheng, et al. "Vulnerability of LLMs to Vertically Aligned Text Manipulations." arXiv preprint arXiv:2410.20016 (2024).

Andriushchenko, Maksym, and Nicolas Flammarion. "Does Refusal Training in LLMs Generalize to the Past Tense?." arXiv preprint arXiv:2407.11969 (2024).

Ren, Qibing, et al. "Codeattack: Revealing safety generalization challenges of large language models via code completion." arXiv preprint arXiv:2403.07865 (2024).

Wu, Zihui, et al. "The dark side of function calling: Pathways to jailbreaking large language models." arXiv preprint arXiv:2407.17915 (2024).

Zheng, Xiaosen, et al. "Improved few-shot jailbreaking can circumvent aligned language models and their defenses." Advances in Neural Information Processing Systems 37 (2024): 32856-32887.

Anil, Cem, et al. "Many-shot jailbreaking." Advances in Neural Information Processing Systems 37 (2024): 129696-129742.

Liu, Yi, et al. "Jailbreaking chatgpt via prompt engineering: An empirical study." arXiv preprint arXiv:2305.13860 (2023).

Sun, Lichao, et al. "Trustllm: Trustworthiness in large language models." arXiv preprint arXiv:2401.05561 3 (2024).

Jiang, Liwei, et al. "Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models." Advances in Neural Information Processing Systems 37 (2024): 47094-47165.

Zhou, Weikang, et al. "Easyjailbreak: A unified framework for jailbreaking large language models." arXiv preprint arXiv:2403.12171 (2024).

Chao, Patrick, et al. "Jailbreakbench: An open robustness benchmark for jailbreaking large language models." arXiv preprint arXiv:2404.01318 (2024).

Derczynski, Leon, et al. "garak: A framework for security probing large language models." arXiv preprint arXiv:2406.11036 (2024).

Munoz, Gary D. Lopez, et al. "PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System." arXiv preprint arXiv:2410.02828 (2024).

Souly, Alexandra, et al. "A strongreject for empty jailbreaks." arXiv preprint arXiv:2402.10260 (2024).

Neronov R., Nizamov T., Ivanov N. LLAMATOR: Framework for testing vulnerabilities of large language models (LLM) https://github.com/LLAMATOR-Core/llamator.

Han, Seungju, et al. "Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms." arXiv preprint arXiv:2406.18495 (2024).

Yandex. YandexGPT 5 Pro. — 2024. — Version 5 Pro.

Google. Gemini 2.5 Flash. — 2024. — Version 2.5 Flash.

OpenAI. GPT-4. — 2024. — Version 1.

Namiot, Dmitry, Eugene Ilyushin, and Ivan Chizhov. "Artificial intelligence and cybersecurity." International Journal of Open Information Technologies 10.9 (2022): 135-147.

Ilyushin, Eugene, Dmitry Namiot, and Ivan Chizhov. "Attacks on machine learning systems-common problems and methods." International Journal of Open Information Technologies 10.3 (2022): 17-22.

Refbacks

There are currently no refbacks.

Abava Кибербезопасность ИТ конгресс СНЭ

ISSN: 2307-8162