Last December, a major retailer launched a new AI customer service agent. Within days, bad actors had manipulated it into discussing topics that had nothing to do with retail, but had incredible potential for brand damage. The vendor’s explanation: the agent’s guardrails had been ‘inadvertently misconfigured’. The fix came quickly.
But the implications lingered: a well-resourced AI deployment, built on a prominent AI agent platform, had gone live without any systematic way to verify that its behavioral constraints were actually working. The misconfiguration wasn’t caught in testing or QA, and the brand found out at the same time the internet did.
While some of these stories of an AI agent failing make for memorable headlines, they’re also outliers. The more common, but still damaging, reality is quieter: AI agents that confidently give a customer the wrong return window, or one that misses a compliance requirement in one or two conversations. These failures rarely make the news, but they do show up in customer satisfaction (CSAT) scores, in escalation rates, and in customer churn.
Enterprises are deploying AI agents faster than they can effectively validate them, and the gap isn't closing. Historically, building a rigorous testing program has required so much manual effort, human judgment, and ongoing maintenance that most teams are forced to triage: cover the obvious scenarios, hope for the best, and patch problems as they surface in production. Others try to shortcut the work with unvalidated LLM judges, ending up with testing programs that look like they're working but don't actually verify the agent is ready for production.
The uncomfortable truth about AI agent testing is that many of these programs are built for launch day, rather than the realities of day-to-day production. They may be thorough enough to clear internal review, but in many cases, erode from there.
Where AI Agent Testing Breaks Down
Why is testing, such a critical part of deploying AI agents, so difficult to get right at scale?
Each step of this process is extensive, requiring time and care. A team must define what “great” looks like in precise, verifiable terms, configure evaluators that cut through the noise and catch true failures, and build test coverage that reflects how customers actually behave. Every model update, prompt revision, or new workflow restarts the cycle, straining how far resources can realistically stretch.
This results in a program that may appear externally comprehensive, but is internally riddled with holes.
That’s what today’s enhancements to Cresta AI Agent’s Automated Testing suite are built to solve, with four new capabilities, each targeting a place where testing programs have historically struggled to scale or hold up over time.
Requirements Adherence: AI Builds the Foundation, Experts Refine It
Requirements have typically lived in PRDs, brand guidelines, or compliance checklists. What they rarely produce is something an LLM evaluator can actually grade against.
AI Agent Testing 2.0 changes that by generating the initial requirements list directly from PRDs, CSVs, or existing documentation, accelerating a process that would otherwise take days, or even weeks for complex use cases. Cresta runs a one-click diagnosis that flags possible issues: unverifiable language, multi-parameter criteria the LLM can’t grade reliably, and requirements that depend on context the agent doesn’t have access to. Domain experts then review, refine, and approve.
Each approved requirement converts directly into an LLM evaluator: one source of truth running against both test conversations and production traffic.
The quality of the evaluators depends on the quality of the requirements themselves. Every requirement is binary: the agent either did something or it didn’t. A true requirement has a clear pass condition, a clear fail condition, and is defensible in an audit:
- The agent must verify the caller’s identity before disclosing account details.
- The agent must offer a specialist transfer if the user raises a billing dispute.
- The agent must refrain from providing specific financial advice.
- The agent must include a privacy disclosure if the conversation touched on personal health data.
Each requirement is also tagged as must-have or nice-to-have, a distinction that changes how organizations prioritize failures. A 98% pass rate looks strong until you realize the 2% of failures all hit must-have requirements. Separating must-have blockers from nice-to-have improvements means the testing program surfaces what teams actually need to act on, not just what happened.
Evaluators You Can Trust and Measure
Many testing programs configure LLM-as-judge evaluators once and trust – or hope – that they’ll continue to work effectively. Some rely on out-of-the-box evaluators that are intended to measure things like empathy or accuracy across a number of use cases, but generic evaluators miss the domain-specific nuance that determines whether an answer is actually right for a given business. Others build their own, but miss systematic feedback on whether the evaluator is too strict, too lenient, or just inconsistent. An evaluator that misfires doesn’t just produce noise; it actively misleads. False positives waste time chasing problems that aren’t real. False negatives let actual failures through unnoticed.
Cresta's approach starts with a hybrid foundation: combining LLM evaluators for nuanced, context-dependent judgments with deterministic evaluators for criteria that can be checked precisely. Evaluator Calibration then gives every requirement-based evaluator a measurable accuracy score. Teams select a group of requirements to calibrate and a source of conversations, and then review how the evaluator labeled a sample and provide feedback where the AI’s judgment missed the mark. Cresta returns an F1 score per requirement, a single metric that balances whether the evaluator is catching what it should without raising too many false alarms, so teams know exactly which evaluators are reliable and which need to be further refined.
At one retail client, evaluator calibration cut the false-alarm failure rate by more than half. The team stopped investigating noise and started investigating actual errors.
This process is not just diagnostic, but also makes iteration measurable. After refining a requirement, teams can re-run against the same labeled conversation set and see whether accuracy improved, making calibration a living feedback loop, rather than a one-time configuration.
Coverage That Reflects How Customers Actually Behave
Building test cases by hand often means that coverage reflects whatever the team had the bandwidth and resources to produce, not necessarily how customers actually behave. Most of the long tail of real customer behavior - the edge cases, the unusual phrasings, the unexpected paths - goes untested.
AI Agent Testing 2.0 automatically generates test cases from four sources, each grounded in real signal:
- Requirements violations: When a closed production conversation surfaces a requirements gap, AI reverse-engineers a Synthetic Customer (a simulated persona, based on customer conversation data) aligned to the scenario, and an evaluator to thoroughly validate the fix. Teams can immediately run the new test case to confirm the fix holds before the agent moves forward.
- Feedback on production conversations: When a reviewer flags a conversation, Cresta can auto-generate a complete test case from that feedback: name, description, expected behavior, and an evaluator aligned to the specific criteria. Every test case can be traced back to its source feedback, allowing teams to track what was caught, what was fixed, and what’s been validated.
- Knowledge base articles: Teams can turn knowledge base content into test cases that verify the agent’s answers are grounded in approved sources. Single-article tests evaluate whether the agent’s response is supported by the right article, and multi-article tests check whether the agent is correctly synthesizing information across two or more articles for more complex queries.
- Top customer questions from real conversations: Cresta extracts the most frequent questions from real customer conversations and turns them into test cases. This allows coverage to extend beyond what is documented in the knowledge base, and capture the actual language customers use to ask questions.
Test cases can be paired with Cresta Synthetic Customers, which generate data-derived personas from real customer conversation data. Rather than simulating how customers might behave, they reflect how your customers actually behave, making simulation testing more representative and the results more meaningful. What gets tested and how it gets graded both reflect reality. Coverage grows as the agent evolves, reviewer feedback accumulates, and more conversations occur over time.
In one e-commerce deployment, variation tests revealed that the agent handled standard return requests correctly but broke on gift-purchase framing: "I want to return this, I got it as a gift." The phrasing triggered a logic path the happy path hadn't exercised. It was caught offline, before any customer saw it.
New Out-of-the-Box LLM Evaluators: Grade the Response to the Stakes, Not Just the Answer
Even teams with well-calibrated evaluators run into a specific failure mode: the grading criteria don't match what the use case actually requires. A strict evaluator applied to an open-ended FAQ answer flags harmless wording variations as failures, while a lenient one applied to a compliance script lets material omissions through. The evaluations produce signal, but not necessarily the right one.
AI Agent Testing 2.0 also extends Cresta’s library of expert-aligned, out-of-the-box LLM evaluators with three more precise options for grading agent responses. Where the existing Golden Response evaluator checks for semantic equivalence, the new evaluators let teams choose the right level of strictness for the particular use case, rather than applying a single standard across the board:
- Completeness checks that the agent provided all the defined claims in the approved response, with additional information allowed – a fit for FAQ and general support answers.
- Accuracy checks that the agent provided only the defined claims in the approved response – a fit for SOP-driven workflows where staying on-script matters, like troubleshooting scripts, compliance language, guided processes.
- Completeness & Accuracy checks that the agent provided all of the defined claims and nothing else. This is the strictest option, designed for deterministic outputs: templates, regulated wording, and test cases where exact behavior is required and any deviation, whether by addition or omission, is considered a failure.
Together, the three evaluators give teams a clearer understanding of what type of failure has actually occurred, and an objective approach to grading agent responses. These evaluators are calibrated by domain experts and validated across use cases and verticals, allowing them to deliver reliable judgments, you guessed it, out of the box.
The New Bar for Production-Ready AI
Thankfully, that major retailer’s snafu was an edge case. But edge cases are exactly what testing programs exist to catch. And with AI Agent Testing 2.0, now they do.
Rigorous AI agent testing isn't just risk mitigation; it's a competitive advantage. Organizations that can deploy with confidence, iterate quickly, and demonstrate compliance will move faster and more safely than those still triaging their testing programs.
Earning and keeping customer trust means more than passing launch review. It means requirements that hold up in production, evaluators you can trust over time, and test coverage that reflects how customers actually behave, not how the team imagined they would.
With AI Agent Testing 2.0, rigorous testing doesn’t stop at launch. It scales with your agent, so every conversation is one you can stand behind - confidence at launch, confidence at scale.






