In our last blog post on Ocean-1, Cresta’s foundation model for the contact center, we showed that, with the Ocean model, we can match GPT-4 in Knowledge Assist while achieving much better cost and lower latency.
Now, this model is live with our customers. We have now expanded the same paradigm to two other use cases: summarization and chat suggestions.
The Ocean Paradigm
In a recent paper by Google Deepmind on levels of AGI, researchers defined progress in AI through both performance and generality. For example, ChatGPT is categorized as an emerging general AI, while AlphaGo is considered a superhuman Narrow AI.
In the process of building the intelligence layer for the contact center, it became clear that we need an expert-level AI that would work well for this domain – one that would have expertise and knowledge.
Expertise is different from knowledge. Expertise is situational and requires long hours of practice. This is where learning by example comes in, either through in-context learning or fine tuning.
That’s why we built this learning paradigm for our Ocean foundation models:
- Public Base Model: Start with an open-source base model (e.g. Mistral) or a partner base model (e.g. OpenAI/Anthropic)
- Ocean Base Model: Instruction fine tuning using synthetic data and partner customer data
-
- Ocean Per Industry: Instruction fine-tuning per industry
- Ocean Per Customer: For enterprise customers, train on instruction data generated by our automated pipeline
- Task Fine tunes: Continue fine tuning on task-specific data so that the model can learn situational expertise to perform at its peak performance
Knowledge Assist
In our previous blog post, we released Ocean for Knowledge Assist and demonstrated that a fine tuned Mistral-7B model can beat GPT-3.5 and its MoE version beating GPT-4. We went live with the fine tuned Mistral model. Both models are significantly cheaper to run inference than the GPTs.
After going live with the Ocean model for this use case with a customer, we evaluated the model’s retrieval-augmented generation (RAG) performance using these more regular metrics:
Coverage | Does the response cover all key points of the reference answer? | 0 – cover no points 1 – cover partial points 2 – cover all points |
Contradiction | Does the response contradict any points of the reference answer? | 1 – does not contradict 0 – contradiction |
Groundedness | Did the response hallucinate – generating facts not present in the retrieved articles? | 1 – no hallucination 0 – hallucination |
Answer Relevancy | How relevant is the generated answer to the question? If the model produces too many irrelevant facts, it should score lower. | 1 – relevant 0 – not relevant |
Citation Validation | Did the model include citations to articles? Are these citations accurate? |
Summarization
In our previous blog post on Auto Summarization, we show how custom models can save hours of after-call work. A smaller language model could deliver post-call summarization in a much shorter time, enabling better agent receptivity. Now with a 7B base model, we can further scale the performance of summarization. Our new summarization model is particularly suited for the contact center domain and most of our deployments require our model to adapt to various customer templates:
Style 1 (default 3rs): |
Style 2 (customized topic): |
Style 3 (customized topic + tab view): |
Style 4 (customized topic + dynamic generation): |
Call reason: xxx Customer’s Name: xxx Resolution steps: – xxx – xxx – xxx Conversation results: – xxx – xxx |
Customized Topic 1: xxx Customized Topic 2: xxx Customized Topic 3: xxx Customized Topic 4: xxx Customized Topic 5: xxx … |
Customized Topic 1: xxx Customized Topic 2: xxx Customized Topic 3: xxx [Tab 1] [Tab 2] Customized Topic 4: xxx |
# when refund is not mentioned Customized Topic 1: xxx Customized Topic 2: xxx Customized Topic 3: xxx # when refund is mentioned |
Here are a few observations we found about our new summarization model.
- Our summary prioritizes customer privacy and meets high standards of sensitive data compliance.
Summary topic | ChatGPT | Cresta Summary |
Reason for delinquency | Customer is going through cancer treatment | Medical |
2. Our summary is more concise, allowing agents to digest the entire summary quickly
Summary topic | ChatGPT | Cresta Summary |
Customer’s complaints | Customer was mistakenly charged for three nights instead of two and was not provided with information regarding the hotel’s policy on group reservations. Additionally, they were not told about the charges and policies prior to checking in. |
|
3. Our model can continuously improve based on edits/feedback made by agents
Summary topic | Possible values from initial summary | Possible values from summary after feedback loop |
Administrative fee | Administrative fee: $ xx Administrative fee: N/A If the specific dollar amount is not mentioned, it will output N/A |
Administrative fee: $ xx Administrative fee: 5% of the loan amount Administrative fee: N/A It learns to use the percentage of the loan amount when the specific dollar amount is not mentioned. |
Auto Compose (Chat)
Auto Compose is a great use case for fine-tuning Ocean models. We can leverage the large amount of transcripts available to derive a robust conversational model.
BLEURT Score | % time is better | |
Previous Production Model | 0.4340 | 10.88% |
New Mistral Based Finetune | 0.4925 | 5.79% |
Going Forward
We believe that a treasure trove of expertise sits inside our customers’ data. As an intelligence layer for customer conversations, Cresta integrates into hundreds of systems to build a unified understanding of expertise from this private data – in a highly secure, responsible way (read more about our commitment to responsible AI here). We plan to bring this paradigm to more contact center workflows and supercharge coaching, assistance, and automation with a collection of expert models.