Engineering
December 19, 2025

Evaluating AI Voices – What Does It Mean to Sound “Good”?

Henry Zhang
Product Manager, AI Agent
Till Poppels
NLP Specialist
Hagyeong Shin
NLP Specialist

"Your AI Agent doesn't sound good"—this critique is perhaps the deepest dread for anyone building a voice agent. The disappointment is often quickly followed by confusion: what, precisely, does it mean for an AI to sound "good"? In this article, we aim to resolve this question, drawing upon Cresta’s structured Text-to-Speech (TTS) evaluation process and our experience in collaborating with enterprises to meticulously curate the optimal voice for their brand identity.

The immediate, common association is that a high-quality AI voice should sound natural, or human. Yet, even among human voices, parity is elusive; some resonate far more profoundly than others. Consider the rich depth of Morgan Freeman in Se7en, the sophisticated grace of Audrey Hepburn in Breakfast at Tiffany’s, or the effortless self-assurance of Sean Connery in 007. For an AI voice to truly excel, it depends on both the speaker's inherent vocal characteristics, how the voice is controlled in conversations, and how the TTS recreates the cloned voice.

Factors largely inherent to the speaker

While the most gifted actors can fluidly mimic nearly any voice, every person possesses a typical, or natural, way of speaking. The brightness of a voice and its average pitch are generally foundational and intrinsic to the speaker. At Cresta, we frequently encounter requests from clients seeking a "warm voice with executive presence"—a preference often realized through a darker voice and a lower average pitch. The distinction between these two dimensions is a crucial nuance. Pitch is readily understood, relating simply to how high or low the note is. Brightness, conversely, is more evocative: a bright voice feels clear and luminous, like sunlight refracting on glass, while a dark voice sounds rich, comforting, and resonant, like a mellow, deep musical instrument.

These recordings bring the nuanced difference to life:

Factors controllable by the speaker and the TTS

The naturalness of a synthetic voice, and whether it is truly production-ready, is often more dependent on the controllable factors introduced by the speaker than on the voice's inherent qualities. Furthermore, the process of cloning a voice for TTS can introduce unintended artifacts or "hallucinations." To ensure high quality, every AI voice is rigorously assessed across four key dimensions before it earns the classification of 'production-ready':

1) Accuracy: This is the baseline. If an AI agent mangles emails, URLs, reservation numbers, or names, credibility evaporates. Also, the correct pronunciation of heteronyms needs to be determined by context (e.g., "lead the team" vs. "lead is a heavy metal"). Failing to do so immediately signals to the caller that they are talking to a bot.

Here is an example of TTS mangling words:


Example evaluation criterion: Every word should be clear and correctly pronounced on the first listen.

2) Stability: Demos hide it; production reveals it. Abrupt jumps in volume or pitch make the utterance from a single turn sound like different speakers stitched together. We require smooth, continuous contours—no sudden spikes, no “who turned the dial?” moments.

Here is an example of fluctuations in volume:


Example evaluation criterion: Volume varies minimally and smoothly, never abruptly.

3) Naturalness: Robotic voices space words like metronomes. Human speech breathes. Pauses flex with meaning – the duration of the pauses should vary between words, ideas, and sentences.

Here is an example of lacking pauses:


Example evaluation criterion: Pauses within and between sentences are sufficient and vary naturally.

Beyond pauses, naturalness encompasses several other dimensions. We evaluate for a speech rate and a pitch contour that vary naturally to reflect both the emotional content and the sentence type. Furthermore, we ensure that any paralinguistic sounds, such as breaths or laughs, flow naturally with the sentence.

The naturalness of an AI voice is influenced by both the speaker's vocal control during recording and the training of the TTS model.

4) Professionalism: There are voice characteristics common in everyday conversations that are not considered professional when representing certain brands. For example, vocal fry can suggest hesitation or a lack of confidence, while chronic uptalk can imply uncertainty and make statements sound like questions.

Here is an example of uptalk:


Example evaluation criterion: Statements always end with falling or flat pitch (no uptalk).

If the voice recording used for TTS cloning contains excessive vocal fry or uptalk, TTS models often pick them up. A more effective approach for achieving a professional-sounding voice is to coach the speaker to project their voice more, which helps reduce vocal fry, and to minimize upward inflection at the end of declarative sentences.

Evaluation process

To evaluate a vendor or a specific voice, we employ a rigorous four-step process:

  1. Preliminary Review: We begin with a rapid evaluation conducted by subject-matter experts.   We use a set of audio examples hand-picked by our experts,  highly representative of a range of emotions and business context crucial in contact center conversations. This initial screening allows us to swiftly identify and remove vendors that clearly fail to meet our baseline criteria.
  2. Human Labeling: For the remaining shortlisted vendors, we apply their TTS voices on an extended set of scripts. This set of scripts are carefully curated to cover various utterance types, emotional tones, sentence lengths, commonly observed TTS errors, and contact center conversation contexts. Multiple human subject-matter experts then manually evaluate each audio clip against all quantifiable quality dimensions. These labels are used to train a Large Audio Model (LAM) as a judge, ensuring it is aligned with our subject-matter experts’ preferences across every dimension.
  3. LAM Evaluations: Manually evaluating hundreds of voices is not tractable. Therefore, we utilize our LAM as judges to assess each voice across all specified dimensions, generating a comprehensive composite score. This score synthesizes performance across the four factors—accuracy, stability, naturalness, and professionalism. Crucially, the LAM judges do not evaluate isolated, one-off sentences. Instead, they assess the TTS voice as it powers an AI Agent within a simulated, interactive conversation, ensuring that each voice is directly evaluated within our enterprise customers’ contact center conversation context. Only vendors and voices that achieve a minimum qualifying score advance to the final stage.
  4. Blind Panel Testing: Although the finalists have passed objective evaluations, human perception remains the ultimate arbiter of quality. We convene a diverse panel of external testers—sometimes including executives from our own customers—to provide unbiased feedback, free from the internal development process. Each panelist is given several distinct phone numbers to call. While each number is powered by a different TTS vendor, all use an identical voice clone, ensuring a true, apple-to-apple comparison. The Cresta team then finalizes the voice ranking based on the panelists' stack-rankings and their accompanying rationales.

This comprehensive evaluation framework allows us to continuously refine our TTS vendor partnership strategy and ensure the voices we recommend are optimally suited to our customers’ specific industry, needs, and brand identity.

Download our Enterprise Buyer's Guide to AI Agents to cut through the hype and choose an AI agent solution that drives lasting performance, reliability, and customer satisfaction.