Cut Costs, Not Quality with Cresta AI Agent – Read the post

  • Products
    Back
    PLATFORM
    AI Platform
    Cresta is the enterprise-grade Gen AI platform built for the contact center and trained on your data.
    • Cresta Opera
    • Integrations
    • Responsible AI
    PRODUCTS
    AI Agent
    Cut costs, not quality, with human-centric AI agents you can trust
    Agent Assist
    Harness real-time generative AI to empower agents with unmatched precision and impactful guidance.
    • Knowledge Assist
    • Auto-Summarization
    Conversation
    Intelligence
    Discover and reinforce the true drivers of contact center performance.
    • Cresta Insights
    • Cresta Coach
    • Cresta Quality Management
    • Cresta AI Analyst
  • Solutions
    Back
    USE CASES
    Sales
    Discover and reinforce behaviors that accelerate revenue growth
    Customer Care
    Deliver brand-defining CX at a lower cost per contact
    Retention
    Transform churn risks into
 lifelong promoters
    Collections
    Accelerate collections while minimizing compliance risk
    INDUSTRIES
    Airlines
    Automotive
    Finance
    Insurance
    Retail
    Telecommunications
    Travel & Hospitality

    Why Transcription Performance Is Holding Back Your AI Strategy

    LEARN MORE
  • Customers
    Back
    Customer Stories
    Learn how Cresta is delivering lasting value for our customers.
    • CarMax
    • Oportun
    • Brinks Home
    • Snap Finance
    • Vivint
    • Cox Communications
    • Holiday Inn
    • A Top Telecom
    • View all case studies

    Our Own Zero to One: Lessons Learned in Building The Brinks Home AI Agent

    LEARN MORE
  • Resources
    Back
    Resources Library
    • Webinars
    • Ebooks
    • Reports
    • Solution Briefs
    • Data Sheets
    • Videos
    • Infographics
    • Media Coverage
    • Press Releases
    Blog
    Industry News
    Help Center
    Solution Bundles

    AI Maturity Blueprint: A Practical Guide to Scaling AI Adoption in the Contact Center

    LEARN MORE
  • Company
    Back
    About Cresta
    Careers
    Trust
    Customers
    Partners

    We’re Going Global! Cresta Expands to APAC and EMEA

    READ THE POST
Request a demo
Request a demo
  • Cresta Blog
  • AI Innovation
  • Industry Leadership

How Ocean-1 enhancements beat GPT-4 in powering Knowledge Assist

Today, we share an update to Ocean-1: A small 7B Ocean model beats GPT-4 in retrieval-augmented generation (RAG) while being 100x more cost-effective.

Several months ago, we introduced Ocean-1, the world’s first foundation model for the contact center. Ocean achieves better out-of-box capabilities, instruction-following ability, and latency/cost-effectiveness through domain-specific finetuning of robust base LLMs. In other words, the model first becomes good at basic (English) language, then trains to be an excellent sales or customer service agent.

Since Ocean’s introdution, the field has made significant progress on open-source base LLMs, such as Mistral 7B, Phi-2, and Yi-34B. In particular, the recent release of the Mixtral 8x7B model, a mixture-of-experts (MoE) model, achieves better accuracy than ChatGPT 3.5 on many tasks.

In this post, we’ll show how to leverage these advancements along with high-quality domain-specific data and explore how our efforts to finetune the Mixtral model have resulted in a better-than-GPT-4 model for retrieval-augmented generation.

Retrieval-Augmented Generation

Cresta uses Retrieval-Augmented Generation (RAG) to power our knowledge assist feature.

In knowledge assist, the real-time AI listens to the human-to-human conversation, detects moments to surface knowledge, automatically searches the knowledge base, and then uses LLM to generate the final response.

 

Under the hood, Cresta uses Retrieval Augmented Generation (RAG) to push the right information at the right time to provide context for the LLM. One could use an LLM such as GPT-3.5 or GPT-4 for each stage of the inference process, including query extraction, article summarization, and response generation. However, this approach could encounter a few challenges.

figure 1

First and foremost: Cost. At the time of writing, GPT-4 costs 0.03/1k prompt tokens. Every time we run the RAG end-to-end, we’ll likely send the conversation, the question, and the retrieved articles to the LLM. It could easily add up to ~1k tokens. Therefore, each complete invocation of the RAG would cost ~3 cents. Assuming ten invokes per chat, a customer with 1 million conversations a year could easily spend $300,000 on knowledge assistance alone.

Next, low latency is critical for real-time voice applications like this. We found lower latency correlates strongly with adoption, as users are less likely to ignore the knowledge suggestions.

Realizing LLMs are not one-size-fits-all. We set on the journey to build custom Ocean models for RAG.

figure 2

Ocean for Knowledge Assist

Our training data consists of synthetic and customer-specific data. Each example is in the format of

  • partial chat (optional): moment when the question is triggered.
  • question: extracted query for the knowledge base.
  • retrieved articles: article snippets returned from our retrieval system.
  • answer: gold labeled response generation.

For synthetic data, we construct a knowledge base through a web scrape, then ask GPT-4 to develop questions and generate the answers. It creates a large dataset of FAQs from a diverse set of domains, which improves the ability to use retrieved documents in the proper context and generate relevant responses. This step enhances base LLMs’ ability to reason over document context and summarize high-quality responses.

figure 3

For customer-specific data, we use a query extraction model to run through a large corpus of conversations. Each query retrieves articles and GPT-4 generates a ground truth answer. This step helps finetune LLMs’ capability in the customer domain and the ability to continuously learn from the evolving knowledge base.

Training and Evaluation

The training dataset is converted into ChatML format. Then, we train both Mistral and Mixtral based models using LoRA and Axolotl. The trained model is evaluated with GPT-4 against human written responses: Correct = 5 points, Need Improvement = 2 points, Wrong = 0 points.

figure 4

 

As we can see from this initial benchmark, Mixtral 8x7B base model significantly improves the accuracy of the original Mistral 7B, when trained on the same data. And it surpasses GPT-4 in output quality.

Feedback Loop

The LoRA-based architecture allows us to serve one adapter per customer. Not only does it save GPU memory, but it also allows us to improve each model based on customer feedback data continuously:

figure 5

Our thumbs-up and down buttons are used as a signal to critique the model output. In turn, we can collect more training data as the product gains more usage.

Serving at Scale

We have partnered with Fireworks AI to serve mixtral/mistral-based Ocean models. One single base model cluster is set up for Cresta, while different LoRA adapters on top serve different customers. Fireworks can scale up to thousands of LoRA adapters, so we don’t need completely separate models for each customer and use case. This allows us to achieve 100x cost reduction compared to using GPT-4. (*benchmark result is shown per serving unit. And can scale as we add more capacity).

figure 6

Conclusion

The results show that with a domain adaptive LLM, it’s possible to achieve the best accuracy in response generation while also being 100x cheaper. It echoes our thesis that for enterprise use cases, there is no “one model fits all” — LLM developers must consider cost, latency, and accuracy trade-offs.

The breakthroughs in small open-source base models, such as Mistral and Mixtral, have unlocked the potential for this paradigm. We plan to bring it across our products, such as chat suggestions, auto summarization, etc. Combined with preference optimization techniques such as DPO, these models can continuously improve using usage data and offering sticky ROI for our customers.

To learn more about how Cresta leverages the latest innovations to improve output and cut costs, schedule a personalized demo today! Or if you would like to work at the intersection of cutting-edge finetuned LLMs and real-world impact, we are hiring! 🧑‍💻👩‍💻🌊

Author:

Cresta LLM Team

December 20, 2023

Introducing Ocean-1: First Contact Center Foundation Model

READ MORE

The Emerging Stack of Generative AI

READ MORE

Does One Large Language Model Fit All?

READ MORE

100 South Murphy Ave Ste 300
Sunnyvale, California 94086

Karl-Liebknecht-Str. 29A
10178 Berlin, Germany

100 King Street West
1 First Canadian Place, Suite 6200
Toronto ON M5X 1E8

Info
  • AI Platform
  • Customers
  • Resources
  • Partners
  • Trust
  • About
  • Careers
  • Blog
  • Support
  • Contact Us
Follow us
  • LinkedIn
  • YouTube
  • Twitter

Newsletter

Subscribe for the latest news & updates

© 2025 Cresta

  • Terms of Service
  • Privacy Policy
  • Employee Privacy Notice
  • Privacy Settings