Today, we share an update to Ocean-1: A small 7B Ocean model beats GPT-4 in retrieval-augmented generation (RAG) while being 100x more cost-effective.
Several months ago, we introduced Ocean-1, the world’s first foundation model for the contact center. Ocean achieves better out-of-box capabilities, instruction-following ability, and latency/cost-effectiveness through domain-specific finetuning of robust base LLMs. In other words, the model first becomes good at basic (English) language, then trains to be an excellent sales or customer service agent.
Since Ocean’s introdution, the field has made significant progress on open-source base LLMs, such as Mistral 7B, Phi-2, and Yi-34B. In particular, the recent release of the Mixtral 8x7B model, a mixture-of-experts (MoE) model, achieves better accuracy than ChatGPT 3.5 on many tasks.
In this post, we’ll show how to leverage these advancements along with high-quality domain-specific data and explore how our efforts to finetune the Mixtral model have resulted in a better-than-GPT-4 model for retrieval-augmented generation.
Retrieval-Augmented Generation
Cresta uses Retrieval-Augmented Generation (RAG) to power our knowledge assist feature.
In knowledge assist, the real-time AI listens to the human-to-human conversation, detects moments to surface knowledge, automatically searches the knowledge base, and then uses LLM to generate the final response.
Under the hood, Cresta uses Retrieval Augmented Generation (RAG) to push the right information at the right time to provide context for the LLM. One could use an LLM such as GPT-3.5 or GPT-4 for each stage of the inference process, including query extraction, article summarization, and response generation. However, this approach could encounter a few challenges.
First and foremost: Cost. At the time of writing, GPT-4 costs 0.03/1k prompt tokens. Every time we run the RAG end-to-end, we’ll likely send the conversation, the question, and the retrieved articles to the LLM. It could easily add up to ~1k tokens. Therefore, each complete invocation of the RAG would cost ~3 cents. Assuming ten invokes per chat, a customer with 1 million conversations a year could easily spend $300,000 on knowledge assistance alone.
Next, low latency is critical for real-time voice applications like this. We found lower latency correlates strongly with adoption, as users are less likely to ignore the knowledge suggestions.
Realizing LLMs are not one-size-fits-all. We set on the journey to build custom Ocean models for RAG.
Ocean for Knowledge Assist
Our training data consists of synthetic and customer-specific data. Each example is in the format of
- partial chat (optional): moment when the question is triggered.
- question: extracted query for the knowledge base.
- retrieved articles: article snippets returned from our retrieval system.
- answer: gold labeled response generation.
For synthetic data, we construct a knowledge base through a web scrape, then ask GPT-4 to develop questions and generate the answers. It creates a large dataset of FAQs from a diverse set of domains, which improves the ability to use retrieved documents in the proper context and generate relevant responses. This step enhances base LLMs’ ability to reason over document context and summarize high-quality responses.
For customer-specific data, we use a query extraction model to run through a large corpus of conversations. Each query retrieves articles and GPT-4 generates a ground truth answer. This step helps finetune LLMs’ capability in the customer domain and the ability to continuously learn from the evolving knowledge base.
Training and Evaluation
The training dataset is converted into ChatML format. Then, we train both Mistral and Mixtral based models using LoRA and Axolotl. The trained model is evaluated with GPT-4 against human written responses: Correct = 5 points, Need Improvement = 2 points, Wrong = 0 points.
As we can see from this initial benchmark, Mixtral 8x7B base model significantly improves the accuracy of the original Mistral 7B, when trained on the same data. And it surpasses GPT-4 in output quality.
Feedback Loop
The LoRA-based architecture allows us to serve one adapter per customer. Not only does it save GPU memory, but it also allows us to improve each model based on customer feedback data continuously:
Our thumbs-up and down buttons are used as a signal to critique the model output. In turn, we can collect more training data as the product gains more usage.
Serving at Scale
We have partnered with Fireworks AI to serve mixtral/mistral-based Ocean models. One single base model cluster is set up for Cresta, while different LoRA adapters on top serve different customers. Fireworks can scale up to thousands of LoRA adapters, so we don’t need completely separate models for each customer and use case. This allows us to achieve 100x cost reduction compared to using GPT-4. (*benchmark result is shown per serving unit. And can scale as we add more capacity).
Conclusion
The results show that with a domain adaptive LLM, it’s possible to achieve the best accuracy in response generation while also being 100x cheaper. It echoes our thesis that for enterprise use cases, there is no “one model fits all” — LLM developers must consider cost, latency, and accuracy trade-offs.
The breakthroughs in small open-source base models, such as Mistral and Mixtral, have unlocked the potential for this paradigm. We plan to bring it across our products, such as chat suggestions, auto summarization, etc. Combined with preference optimization techniques such as DPO, these models can continuously improve using usage data and offering sticky ROI for our customers.
To learn more about how Cresta leverages the latest innovations to improve output and cut costs, schedule a personalized demo today! Or if you would like to work at the intersection of cutting-edge finetuned LLMs and real-world impact, we are hiring! 🧑💻👩💻🌊