AI agents represent your brand in real-time. Mistakes are not just awkward; they can be costly. A single hallucinated response, policy violation, or off-topic detour can hurt customer trust or lead to significant financial, legal, or reputational damage. Recent public failures in generative AI have shown that a cool demo or fast deployment doesn’t always translate to production readiness.
With AI usage expanding across support and operations, the risks are escalating. According to Gartner (2024), service organizations will face a 300% increase in fraud attempts by 2027 compared to 2023.
Enterprises need adaptive, robust guardrails that address a wide and evolving threat landscape, including:
- Adversarial attacks: Prompt injection, AI jailbreaking or tactics that exploit system behavior to bypass guidelines or access unauthorized data.
- Sensitive or restricted topics: Handling of regulated claims (e.g., APR disclosures, health advice, legal opinions) or inappropriate content such as competitor comparisons or geopolitical issues.
- Organizational red lines: Domain-specific rules and banned phrases, such as “never mention [X],” which require precise, enforceable filters.
- Hallucinations: Generating ungrounded or fabricated responses that sound convincing but are not based in source truth.
As data, regulations, and attack methods evolve, static tuning quickly becomes outdated. To remain safe and effective, AI agents need ongoing monitoring and continuously evolving guardrails that adapt faster than any one-time adjustment. Because the base LLM cannot be practically fine-tuned to every business’s specific policies, workflows, and risk tolerances, built-in safeguards often miss certain types of attacks. This is why additional, context-specific guardrails are essential for comprehensive protection.
This technical deep dive unpacks Cresta’s three strategic pillars of AI agent defense: how they work, how each integrates, and why this approach delivers safer, faster, and more resilient AI agents.
Note: The guardrail examples shared in this blog are illustrative and do not reflect the actual guardrail prompts used in production. For security and safety reasons, we do not expose real prompts. These examples are simplified to demonstrate concepts, not the complexity or robustness of our live systems.
Three Strategic Pillars of AI Agent Defense
Building safe, scalable AI agents requires more than clever prompts or after-the-fact moderation. At Cresta, we’ve developed an architecture for regulated, brand-sensitive, high-volume environments across three pillars:
- System-level guardrails: Integrated into the AI agent’s system prompt and orchestration code, these instructions define the boundaries of the AI agent’s operation by outlining what is considered in-scope versus out-of-scope, and how to respond to detours or policy violations.
- Supervisory guardrails: These are classifiers that run in parallel with the AI agent to detect malicious or risky user inputs, such as manipulative redirection attempts, code generation requests, or queries designed to gain access to the system prompt.
- LLM-driven adversarial testing: We develop new strategies and attack vectors using advanced reasoning models to continuously test for weaknesses and evolve our guardrails alongside these attack strategies.
This strategy proactively addresses risks in real time, intercepts threats before they can cause harm, and continually probes for vulnerabilities, all while monitoring live performance at scale without sacrificing latency or user experience.

Pillar 1: System-Level Guardrails – Blocking Risk at the Source
System-level guardrails are rules enforced in system prompts and orchestration code. They keep AI agents aligned with enterprise policies from the start and prevent high-stakes risks such as large-scale fraud, exposure of protected data, or disruption of essential services.
They form the first barrier against unsafe behavior by:
- Enforcing coded rules and compliant responses for prohibited requests
- Applying policy-compliant deterministic flows for sensitive actions such as authentication and transactions
- Limiting the AI agent’s tools and data access to approved scopes
- Routing flagged interactions to human agents with relevant context for faster handling
Example policies across industries:
- Banking: Never present account numbers or credentials in responses; instead, direct customers to secure, authenticated channels such as verified portals or apps for sensitive actions.
- Healthcare: Do not provide medical diagnoses or treatment recommendations; share only approved educational resources or escalate to licensed clinicians.
- Retail: Always check live inventory and policy tables before confirming availability, pricing, or returns; never guess.
These policies are reinforced by deterministic code that limits data access and provides real-time information to the AI agent.
Pillar 2: Supervisory Guardrails – Context-Aware, Real-Time Protection
Supervisory guardrails are classifier models that monitor interactions in parallel with the AI agent. They detect and prevent user inputs that violate policies. If a violation is detected, the AI agent stops its planned output (even mid-response) and responds with a safe fallback message, escalates the issue, or ends the conversation.

Examples:
- Healthcare: A patient scheduling an appointment asks for another patient’s medical records. A supervisory guardrail detects this privacy violation in the context of a legitimate scheduling request, blocks the request, and routes the AI agent to respond with a safe fallback explaining that medical records cannot be shared without proper authorization.
- Banking: A corporate client requests immediate changes to authorized signatories without verification. The fraud and Know Your Customer (KYC) classifiers flag the request as non-compliant, and the AI agent halts the action and informs the client that the request must go through secure manual approval to prevent fraud.
- Insurance: While updating contact details, a customer asks for the claim status of another policyholder. The guardrail intercepts this request as a privacy breach and directs the AI agent to provide guidance on how the authorized policyholder can obtain the information.
By combining multiple classifier guardrails targeting individual risks, such as prohibited discount checks and sensitive topic detection, Cresta enables precise, real-time policy enforcement without slowing end user experience. Organizations can extend these checks with domain-specific rules, ensuring precise policy enforcement that adapts to each conversation.
Pillar 3: LLM-Driven Adversarial Testing – Scalable, Rapid Response to Emerging Threats

Guardrails must adapt as attack methods evolve. To strengthen guardrails and stay ahead of new threats, Cresta employs an iterative adversarial testing loop:
- LLM-powered attacker models continuously generate multi-step prompts intended to bypass safeguards, uncovering weaknesses that humans might not uncover or anticipate.
- An independent review process evaluates if the guardrails successfully blocked the attempt, avoiding bias from the attacker model and strengthening confidence in defenses.
- Missed detections are added to an attack library built from research, public datasets, and historical attack data.
- Reasoning models build on this library to develop new attack strategies for further testing, ensuring rapid iteration and hardening of guardrails.
By using an LLM with minimal restrictions as the attacker, Cresta can explore a wide range of bypass strategies that would be impractical to develop manually. Human researchers complement this process by uncovering corner cases and validating improvements, ensuring Cresta’s overall security posture evolves holistically. This combination of scalable LLM-driven testing and human expertise ensures defenses remain aligned with an evolving threat landscape.
The Adaptive Guardrail Flywheel: An Evolving AI Agent Safety System
Our three pillars work together in a closed-loop system where live customer interactions and synthetic adversarial tests continuously feed insights back into the guardrail design. Real-world data identifies emerging risks, while targeted adversarial probes stress-test how well each AI agent is protected. These probes are industry-specific to reflect real-world risk, not general-purpose prompts. For example:
- In banking, an AI agent might be tested on how it responds to unauthorized transfer requests or attempts to bypass verification protocols, in alignment with FINRA guidelines.
- In healthcare, it could be challenged with HIPAA-sensitive scenarios, such as attempts to extract protected health information.
- In retail, probes may simulate refund abuse, misapplied promotions, or misleading claims about product availability.
Each detected gap is triaged based on potential impact, added to an adversarial library, and used to refine both system-level and supervisory guardrails. Updates undergo rigorous offline validation to ensure they do not introduce significant false rejections (e.g., more than 0.5% of user requests) or performance degradation. Once evaluated and approved, changes are deployed back into production, ensuring that improvements reach live systems quickly and safely.
This cycle makes Cresta’s guardrails progressively more resilient, enabling them to keep pace with shifting user behaviors and evolving threats, tailored to the organization.
Ready to see how real-time guardrails can protect your brand?
Book a personalized demo and discuss what “enterprise-ready” means for your organization.
Cresta is hiring! If you're passionate about building safe, enterprise-grade AI systems, we'd love to hear from you. Explore opportunities to join our team of engineers and product leaders shaping the future of AI.