Company News

Product Updates

How We Reduced Our Labeling Cost by 10x

Navjot Matharu

At Cresta, we are democratizing expertise for sales and support teams by making every agent an expert. To distill such expertise into software, we ask top agents to demonstrate, and in turn, help us label best practices. Machine Learning models are then trained for customers to maximize their KPIs. Our models continuously learn what top agents do differently and scale those behaviors across entire teams.Apart from providing goal-directed suggestions during ongoing live chats which we talked about in our recent Action Directed GPT-2 blogpost, another unique feature that Cresta offers is real-time coaching assist. As shown below, Cresta provides personalized coaching at key moments in a live chat, to inculcate the required behaviors for every agent to perform like a top agent.

The Real-time Coaching and Agent Assist features mentioned above are powered by our Natural Language Understanding (NLU) pipeline, which is responsible for producing models that help us understand and track the state of the conversation as a chat progresses between an agent and a visitor. The 2 most common tasks which our NLU pipeline solves for, are:

Intent Classification: detecting the intent behind each message from both agent and visitor
Chat Driver Classification: detecting and tracking the main objective behind the visitor reaching out

In 2019, as our customer base started to rapidly grow, one of the biggest challenges we faced was the time and effort required to label data required by our NLU Classification pipeline. To scale as a software company, we strive to maximize our speed of developing and iterating on the required models. In this blog post, we share how our classification pipeline evolved over time and how we reduced our labeling cost and efforts by over 10x, while continuously pushing our accuracy benchmarks forward.

Deep Transfer Learning

As was the case for most NLP pipelines across the world in 2019, the first big jump in efficiency came with the introduction of Deep Transfer Learning. Transfer learning, in the form of pre-trained language models, has revolutionized the field of NLP, leading to state-of-the-art results on a wide range of tasks. The idea is to first pre-train a model on a large unlabeled dataset using a language modeling objective, and then fine-tune it on a smaller labeled dataset using a supervised task of choice.

/></p>Many practical applications of NLP occur in scenarios where there is a scarcity of labeled data. This is where fine-tuning large pre-trained language models has changed the game completely. These models have shown to be <a class=

One-vs-all Classification

Buoyed by the success of the multi-head architecture, we turned our attention to a problem which was proving to be a costly step in our labeling process: handling a

To address the above challenge of iterating on a growing label taxonomy, we converted the multi-class classification problem to a

Multi-head BERT for One-vs-all Intent Classification[/caption]The above architecture gave us the flexibility of adding more classes as we iterated on the taxonomy required to produce the experience desired by our customers, without having having to re-label our existing dataset each time. This architecture could be used both for a single-task or in a multi-task setting by simply prepending the class name with the task name to create a unique identifier for each head.

Binary Labeling Interface with Loss Masking

Data labeling interfaces and best practices, in general, have been an under-researched area – as was touched upon by François Chollet's recent tweet, which sparked a debate amongst the research community. Our experience while trying to scale Machine Learning for business use-cases, pushed us to consider data curation and labeling as any other research problem we were looking to solve.Labeling cost has 2 dimensions – the number of labeled samples required and the average time required to "correctly" label a sample. We realized that the effort and cost required to reach a high quality labeled dataset was often turning out to be a costly step requiring multiple quality assurance iterations. With a much more flexible one-vs-all architecture, instead of just looking for ways to reduce the number of labeled samples required by our models, we started iterating on optimizing our labeling interface with the goal of reducing the difficulty of labeling a given sample.Humans usually have a small attention span, and labeling often can be a very tedious and mundane task. We A/B tested a new labeling interface where labelers would be making a single binary decision at a time, True/False for a pair of (sample, class), determining whether the sample belongs to that class or not.

A labeler could pick a class they wanted to focus on and the interface would present a sample to be labeled in a binary fashion, accompanied by clear labeling guidelines and examples, as shown in the image above. This interface allowed labelers to think about one class at a time, resulting in a lower cognitive load for them, while also allowing us to scale and distribute the labeling tasks more efficiently among the labelers. Our results showed that this interface resulted in ~2x faster labeling, with fewer mistakes made by the labelers.Integrating the Binary Labeling Interface with our one-vs-all architecture meant we had to solve 1 problem: there was no guarantee that for a given sample, all the classes would be labeled. More explicitly, given the large amount of unlabeled data we usually work with, the design choice of labeling one class per sample meant that it was highly likely that for a given labeled sample in our training set, we would not have a supervision signal for all the heads. To address this, we implemented Loss Masking, where for a given sample we masked the loss for all the heads we didn't have a label for. As demonstrated in the image below, for each sample, the loss is only applied to heads for which we have a label in the training batch.

Active Learning

Next, we turned our attention towards pushing the boundary around how sample-efficient Deep Transfer Learning could be, by introducing Active Learning in the pipeline. Our goal was to explore what can be achieved both in terms of accuracy and the associated labeling cost when these large pre-trained language models are used in conjunction with Active Learning techniques.Similar to how humans learn, giving a model the power to interactively query a human to obtain labels at certain data points – i.e. introducing human guidance at various intervals – can dramatically improve the learning process. This is the key idea behind

As described in the above plot using a toy dataset, choosing the optimal data points to label can dramatically reduce the amount of labeled data the model might need[/caption]Active Learning is an iterative process, which can be described by the following steps

Step 1: Label a small set of data, instead of investing huge labeling resources and cost upfront
Step 2: Train a model on the above and then use it to predict outputs on unlabeled data
Step 3: From the predictions, select data points based on a sampling strategy (for example Uncertainty Sampling – which selects data points the model is most uncertain about) and label those to include in the training dataset
Step 4: (Back to Step 2) Retrain the model with the updated dataset and repeat the rest of the steps until a satisfactory quality is achieved

Workflow Automation

Active Learning by definition involves a periodic human intervention in the process – which meant that for the above workflow to function effectively, we needed a single interface where our data team could seamlessly label new data points and immediately study its effect on the model to minimize the time delay between iterations.To achieve the above, we developed a single tool with the following features:

Labeling
- Using offline-clustering or user-specified regexes, filter out samples to label (Bootstrapping)
- Using an existing model and a sampling strategy, filter out samples to label (Active Sampling)
Dataset Updates & Training
- Have a single button (call to action) to create an updated labeled dataset from the available set of labels, fire a training run, and run the trained model on an unlabeled dataset to have it ready for Active Sampling
- Periodically pull in fresh unlabeled data from the customer, and run the latest model on it to have it ready for Active Sampling
Model Evaluation
- For a trained model, calculate different accuracy metrics on a given dev/test set
- For a trained model, show misclassifications per class from a given dev/test set

[caption id="attachment_22789" align="aligncenter" width="751"]

Active Learning workflow automation[/caption]As shown in the above image, our goal was to automate as many steps as possible in the iterative workflow, maximizing the speed of iteration and minimizing the manual human effort required. The large pre-trained language models powering our classifiers are known to have a vast amount of world knowledge trapped inside them, allowing them to careers page for open positions.

Acknowledgments

Shubham Gupta for continuous contributions to the Active Learning workflow automation.Tim Shi for overseeing the various projects described in this blog post.Jessica Zhao, Lars Mennen, Motoki Wu, Saurabh Misra, Shubham Gupta and Tim Shi for edits and reviews on this blog post.

References

Universal Language Model Fine-tuning for Text Classification (Howard and Ruder, 2018)
The Natural Language Decathlon: Multitask Learning as Question Answering (McCann et al., 2018)
Multi-Task Deep Neural Networks for Natural Language Understanding (Liu et al., 2019)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
In Defense of One-Vs-All Classification (Rifkin and Klautau, 2004)
Active Learning Literature Survey (Settles, 2010)
Language Models are Unsupervised Multitask Learners (Radford et al., 2019)

So You've Moved to a Virtual Contact Center, Now What?

Curious to know how your virtual contact center strategy is faring? We asked a panel of ex...

Learn more

Announcing Cresta for Customer Service

Real-Time Assistance and Conversational Insights for Customer Service Teams

Learn more

Cresta joins AWS Contact Center Intelligence (CCI) Partner Network

Now, customers can easily bring contact center solutions into their business, and immediately add Cresta’s unique ability to quickly improve each agent in their real-time interactions.

Learn more

How We Reduced Our Labeling Cost by 10x

Deep Transfer Learning

One-vs-all Classification

Binary Labeling Interface with Loss Masking

Active Learning

Workflow Automation

Acknowledgments

References

Related Blog articles

So You've Moved to a Virtual Contact Center, Now What?

Announcing Cresta for Customer Service

Cresta joins AWS Contact Center Intelligence (CCI) Partner Network