The 3 Hidden Costs of Crowdsourcing for Data Labeling

After more than a decade of data labeling for companies around the globe, we’ve learned that crowdsourcing can get you access to a large number of workers. It also can create burdensome issues that are costly and can delay production, especially for machine learning. These problems are particularly burdensome with data labeling for machine learning (ML).

Here are some of the hidden costs of the crowd.

1. Poor data quality

Anonymity is a bug, not a feature, when it comes to crowdsourcing. Workers have little accountability for poor results. When task answers aren't straightforward and objective, crowdsourcing requires double-entry and consensus models to be used as a control measure.

If you’re unsatisfied with the work, often you must send the work through again, hoping for a different result, placing more of the more of the quality assurance (QA) burden on your internal team. Each time a task is sent to the crowd, costs rack up.

In the past, used a microtasking platform that can distribute a single task to multiple workers, using a consensus model to measure quality. Our client success team found this model costs at least 200% more per task than processes where quality standards can be met from the first pass.

Similar results emerged when Hivemind, a software company that helps companies create datasets for analysis, conducted a study comparing crowdsourcing with managed teams. They found the quality of managed teams to be up to 25% higher than crowdsourced teams for the same tasks.

Bottom line: Managed teams are better suited to tasks requiring high quality because they can handle more nuanced tasks and get them right the first time.

2. Lack of agility with your workforce or tools

In AI development, tasks can change as you train your models, so your labeling workforce and tools must be able to adapt. It helps to have a tight feedback loop with your workforce. As you make changes to improve your process, workers increase their domain expertise and context so they can adapt quickly to changes in the workflow. Your model performance and overall quality will improve faster.

Crowdsourcing limits that agility to modify and evolve your process, creating a barrier to worker specialization, or the proficiency with your data and process that grows over time. Workers change, few overcome the learning curve, and you are likely to see inconsistency in the quality of your data. Any changes in your process can create bottlenecks.

You’ll want some agility with your tooling as well. As you learn, you can make adjustments to your labeling tools. And be careful about unnecessarily tying your workforce to your labeling tool. Give yourself the flexibility to swap out the elements of your data production line that aren’t working.

Bottom line: When you use the same labelers, their context and domain expertise - or understanding of your rules and edge cases - increases, resulting in higher quality training data.

3. Management burden

When you crowdsource your data labeling, and even when you are doing the work in-house, you should consider worker churn. As new workers join the crowd, you’ll have to rely on the business rules you created and task new workers with training themselves on how to do the work. If your team is bringing each new worker up to speed, be sure to allocate time for that management responsibility.

With some crowdsourcing options, you are responsible for posting your projects, reviewing and selecting candidate submissions, and managing worker relationships. You’ll need to factor in your costs to attract, train, and manage your group of workers.

Look into who owns your data as part of your agreement. In addition to platform and transaction fees, some crowdsourcing vendors stake ownership on the data their teams label. That means they’re allowed to use your data to train their own algorithms or sell it to serve their own customers.

Bottom line: The crowd will cost more per task in management responsibility. Watch for technology, onboarding, and training fees. Ask vendors how they use your data.

Crowdsourcing can be a good model when you need the work done right away and you need a lot of people on short notice. In this case, your task should be fixed, simple, measurable, and objective. The rapid data turnaround can be helpful in establishing a process and defining business rules. It can be helpful for prototyping, so you can start small and scale fast.

A managed team is a better choice when quality is important and you want to be able to iterate or evolve the work. Be sure to create a closed feedback loop with your workers that makes it possible to evolve your tasks over time. This is especially important with ML because the process from model to production requires collaboration and strong communication across teams of people who are doing disparate work. To learn more about how to scale your data labeling operation, contact us.