Popular open-source AI models easily breached by prompt injection attacks

The analysis showed that moderately well-known LLMs such as StableLM2, Openchat, and Neural-chat are significantly more vulnerable than their state-of-the-art counterparts. These models achieved ASP scores nearing or at 100% across several datasets and attack types, suggesting a systemic weakness in safety mechanisms.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 23-05-2025 23:08 IST | Created: 23-05-2025 23:08 IST
Popular open-source AI models easily breached by prompt injection attacks
Representative Image. Credit: ChatGPT

A comprehensive new study has revealed that most widely-used open-source large language models (LLMs) are dangerously susceptible to prompt injection attacks, raising serious questions about the security and reliability of generative AI tools. The peer-reviewed study, titled “Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs” and published on arXiv, by Jiawen Wang, Pritha Gupta, Ivan Habernal, and Eyke Hüllermeier, systematically evaluates the fragility of 14 major open-source LLMs using three novel attack types across five benchmark datasets.

By introducing a new evaluation metric, the Attack Success Probability (ASP), which factors in model uncertainty in addition to successful responses, the research paints a far more nuanced, and troubling, picture of model robustness. Findings show that even moderately sophisticated prompt manipulation techniques can coerce many leading open-source LLMs into generating harmful, misleading, or illegal content with alarmingly high reliability.

How effective are prompt injection attacks on popular LLMs?

The study evaluated three distinct types of prompt injection attacks: the widely-known "ignore prefix" method, a role-playing chain-of-thought (CoT) attack, and a newly proposed "hypnotism attack." These were tested against 14 prominent open-source LLMs, including StableLM2, Mistral, Openchat, Llama2, and Gemma variants, across five datasets: JailbreakBench, AdvBench, HarmBench, WalledEval, and SAP10.

The "ignore prefix" attack, which simply instructs the model to disregard its original prompt and instead execute an innocuous command, demonstrated the highest effectiveness across the board, achieving ASP scores as high as 0.973 on StableLM2 and 0.931 on Mistral. Meanwhile, the hypnotism attack, framed as a psychological suggestion embedded in a benign task, achieved an ASP of up to 0.994 on Neural-chat and 0.985 on StableLM2.

Notably, these attacks did not require internal access to the models or any specialized computing resources, underscoring the real-world feasibility and severity of the threat. The ASP metric also enabled the researchers to quantify ambiguous outputs, where models hesitated but did not outright reject malicious instructions, responses which previous metrics would have overlooked.

Which models are most at risk and why?

The analysis showed that moderately well-known LLMs such as StableLM2, Openchat, and Neural-chat are significantly more vulnerable than their state-of-the-art counterparts. These models achieved ASP scores nearing or at 100% across several datasets and attack types, suggesting a systemic weakness in safety mechanisms.

Conversely, better-known models like Llama2, Llama3, and Gemma-2b displayed far greater resistance. For example, the Gemma-2b model consistently scored ASP values near 0%, refusing to generate harmful content across all benchmarks. However, the study also identified that even these more robust models are not immune to attack under specific conditions or dataset categories.

Vulnerability also varied based on the type of dataset used. For instance, SAP10, a multi-categorical dataset, exposed significant weaknesses in many models, especially in politically and religiously charged prompts. Hypnotism attacks in these categories achieved ASP scores over 70%, likely due to lower internal harmfulness thresholds in LLMs when processing such topics.

What do these vulnerabilities mean for AI safety?

The findings have serious implications for the broader AI ecosystem, especially given the increasing reliance on open-source LLMs for everything from customer support to educational content and code generation. The study shows that even relatively simple prompt manipulations can result in models generating responses that instruct users on illegal hacking, spreading malware, or inciting violence. One documented successful attack example includes a full step-by-step guide for orchestrating a large-scale terrorist attack, generated by Neural-chat during the SAP10 benchmark.

Beyond direct harm, these vulnerabilities undermine public trust in AI and present ethical and legal challenges for developers, deployers, and regulators. Since these attacks can occur at inference time, without altering the model or accessing training data, they evade many current defensive strategies focused on model alignment and training-time interventions.

Moreover, the study warns that ASP scores alone do not capture the full scale of the issue. Uncertain responses, categorized by the ASP framework as “hesitations”, reveal a troubling gray zone where models neither fully comply nor explicitly reject harmful prompts. These ambiguous responses, often containing partial or misleading guidance, could be just as damaging in real-world applications.

The study advocates for more widespread awareness within the AI community regarding the risks of inference-time attacks and the establishment of standardized safety benchmarks, like the ASP metric. It also highlights the importance of human-in-the-loop approaches and the need to prioritize ethical safeguards alongside technical robustness.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback