New AVI system slashes AI prompt attacks by 82%, sets safety standard for generative models
The explosive growth of LLMs such as GPT-4, Claude, LLaMA, and Grok has intensified concerns around model alignment, toxicity, and data privacy. While many commercial LLM providers have incorporated internal safety mechanisms like Reinforcement Learning from Human Feedback (RLHF) and Moderation APIs, these systems often exhibit limitations: they can over-censor, lack transparency, and are typically rigid or poorly suited to varying regulatory environments.

Researchers have developed a transformative technology to ensure the ethical and secure deployment of generative artificial intelligence. The paper, titled “Innovative Guardrails for Generative AI: Designing an Intelligent Filter for Safe and Responsible LLM Deployment” and published in Applied Sciences, details the development and evaluation of the Aligned Validation Interface (AVI), a novel middleware designed to safeguard large language model (LLM) interactions through intelligent, bidirectional filtering.
The research, led by Olga Shvetsova, Danila Katalshov, and Sang-Kon Lee from Korea University of Technology and Education, proposes AVI as an LLM-agnostic API gateway that proactively monitors, validates, and modifies both user inputs and AI-generated outputs. The system’s core goal is to prevent prompt injection, reduce toxic content, detect personally identifiable information (PII), and improve factual accuracy - all while maintaining user experience and regulatory compliance.
Why do existing LLM safety measures fall short?
The explosive growth of LLMs such as GPT-4, Claude, LLaMA, and Grok has intensified concerns around model alignment, toxicity, and data privacy. While many commercial LLM providers have incorporated internal safety mechanisms like Reinforcement Learning from Human Feedback (RLHF) and Moderation APIs, these systems often exhibit limitations: they can over-censor, lack transparency, and are typically rigid or poorly suited to varying regulatory environments.
The study highlights that relying solely on these embedded safety features poses risks, especially in high-stakes sectors like finance, healthcare, and education. It also emphasizes that open-source and third-party models often lack robust built-in safeguards, leaving organizations exposed to liabilities. To fill this critical governance gap, AVI introduces a modular, external safety framework that delivers real-time validation and context-driven augmentation independent of the underlying LLM architecture.
How does AVI ensure safety and accuracy in AI responses?
AVI operates as a compound AI system composed of four key modules:
-
Input Validation Module (IVM): Scans user queries for threats such as prompt injection, PII exposure, or prohibited content using rule-based filters, ML classifiers, and named entity recognition.
-
Output Validation Module (OVM): Reviews LLM-generated responses for toxicity, bias, policy violations, or factual inconsistencies.
-
Response Assessment Module (RAM): Evaluates factual accuracy and relevance using semantic similarity models and hallucination detection methods.
-
Contextualization Engine (CE/RAG): Enhances responses by injecting trusted knowledge retrieved via retrieval-augmented generation (RAG), ensuring outputs are grounded in verified sources.
Together, these modules allow AVI to perform both negative filtering (blocking harmful or non-compliant content) and positive filtering (enhancing response quality through context). All decisions are managed through an orchestration layer governed by configurable policies. The “traffic light” system - red, yellow, green - guides whether to allow, modify, or block requests and responses.
The study also models the entire decision workflow mathematically, introducing a framework of risk scoring, policy thresholds, and conditional actions. These processes are customizable via structured policy files and are designed to evolve with changing ethical norms and attack patterns.
How effective is the AVI system in practice?
The research team validated AVI through a proof-of-concept system built using open-source tools and evaluated against recognized benchmarks. The findings were compelling:
-
Prompt Injection Mitigation: Using the AdvBench benchmark, AVI reduced attack success rates from 78% to 14%—an 82% improvement.
-
Toxicity Reduction: With 150 prompts from the RealToxicityPrompts dataset, the system cut toxicity in LLM outputs by 75% and modified or blocked 65% of unsafe responses.
-
PII Detection: On a synthetic dataset of 70 sentences, AVI achieved macro-averaged precision (~0.96), recall (~0.94), and F1-score (~0.95), confirming its strong data privacy safeguards.
-
Hallucination Detection: A basic keyword-based mechanism flagged 75% of incorrect factual outputs in controlled tests, setting the stage for more advanced semantic methods.
-
Latency Impact: The validation-only mode introduced a modest 85 ms delay, while RAG-enabled mode added about 450 ms, both within acceptable limits for most applications.
In qualitative tests, AVI’s contextualization engine also enhanced ethical alignment. For instance, in response to a harmful prompt requesting content that promotes eating disorders, AVI not only rejected the request but added medically grounded information about health risks and offered resources for help, illustrating its dual role in censorship and constructive education.
What are the implications for AI governance and enterprise use?
The authors argue that AVI provides a “sovereign gateway” for organizations to control LLM behavior without depending entirely on black-box model alignments. Enterprises and public agencies can customize policies based on local laws (e.g., GDPR or the EU AI Act), internal ethics codes, or specific industry standards. By integrating domain-specific rules, private knowledge bases, and transparent audit logs, AVI promotes responsible AI adoption at scale.
Furthermore, the framework’s explainability meets the growing market demand for accountable AI systems. Each step in AVI’s decision chain can be traced and justified, supporting regulatory reporting, user feedback, and internal compliance audits.
Notably, the study acknowledges a few limitations. The evaluation relied on synthetic and narrow test sets. Hallucination detection, in particular, remains primitive and needs more robust semantic techniques. The PII detection system also requires broader multilingual and real-world data validation. Finally, the entire system depends on continuous human oversight for policy updates and threat monitoring, which could be resource-intensive for smaller organizations.
- FIRST PUBLISHED IN:
- Devdiscourse