It’s their least favorite part of the job, yet it consumes most of their time. Data wrangling – or gathering and preparing dirty data so it can be used in critical business applications – is the biggest problem in data science today, according to a Kaggle survey. So how can you get clean, structured data you can trust?
First, a recap. As we discussed in the first article of this series, dirty data is data that is invalid or unusable. In the second article, we learned that fewer than half (44%) of people trust their organization’s data to make important business decisions. In fact, more than half of executives (52%) said they rely on educated guesses or gut feelings even when they use their data to make decisions. Guessing is risky and can be costly.
And then there’s dark data, which is hidden and could be valuable to your business. When you don’t evaluate that data for strategic use, there’s an opportunity cost. That’s what United Airlines faced when an estimate showed that hidden dark data within its antiquated system was costing the airline $1 billion, largely from bad assumptions about how much each traveler might pay for a seat.
Unstructured data, which must be modified in some way to make it compatible with the system that will consume it, also can hold tremendous value for your business. If it were cleaned or otherwise enriched, it could be used to build new products, to solve painful problems, and to disrupt entire industries.
Leveraging data for your business strategy is no simple task. It takes discovery, planning, and relentless optimization as you execute. It also takes clean data. Data cleansing – also called data wrangling, data munging, or data scrubbing – is a necessary and significant part of data science and growing priority for businesses around the globe.
3 Steps to Clean, Structured Data
Most data cleansing requires a combination of automated and manual techniques. Here’s a three-step process to help you build a foundation that supports clean data for your business over the long term:
1. Standardize how you track and record data in your existing systems
This is where most businesses fall short. This should come as no surprise – in startups and enterprise alike, managers are aware data is being captured but don’t always have visibility to data that lives in other departments. Often, the most useful data is siloed across disparate elements of a company’s tech stack. If you don’t have an easy way to pull up a dashboard or report, you don’t have the intelligence you need to design and iterate on smart strategies.
Start by establishing consensus internally on how data will be gathered, managed, and used by and for the business. Expect that process alone to take several months. Work in parallel to validate your system of record and document decisions made. Anticipate any regulations that could present challenges for your business’ use of data, such as GDPR, and engage legal experts to develop the standards your business will use to comply. Document and educate employees, as appropriate, on your data governance and the importance of clean data to the business.
We learned in article two that when designing an IoT system, four key factors are required to produce high-quality data over time. According to James Branigan, IoT software platform developer and founder of Bright Wolf, those factors are:
- Trust that the right devices are communicating with the correct end system,
- Identity to associate incoming data with the correct time-series history and address messages to the correct device,
- An accurate time stamp for each event and data point, and
- Chain of custody to understand the complete history of each data point – including details about the device and the software that processed the data.
2. Add data and integrate systems wisely – and with caution
Unfortunately, no one system handles all of the data businesses rely on to operate profitably every day. Each layer of your tech stack – and the processes people use to interact with them – must be examined. Consider the factors that are most important to your process (e.g., speed, accuracy, volume of data) and align your stack accordingly.
Integration will be critical in this process. As you add functionality to your system of record, consider carefully how each platform integrates with your tech stack. Think about how each one stores and manages data. If you can, clean data as you import it into new platforms.
As we learned in article two of this series, you can address some quality issues as you join data. A developer can use scripts and coding tools to merge data consistently and accurately for two or more relatively small data sources. You still may find you need to remove duplicates, adjust case and date/time formats, and regionalize spelling (e.g., British English vs. American English).
Be sure to establish how much you plan to iterate your process or the way you manage data over time, as it may dictate what is available to you down the road. For example, if you want more control over how data is consumed or reported, you may want to consider an open source solution that gives you the power to create or adjust particular features, such as accuracy thresholds for data work.
3. Plan for your augmented workforce
At the outset, people will handle whatever your automation cannot handle. That includes exceptions, technology gaps, and quality control. You’ll need confidence in your team, and they’ll need the right technology and training to produce quality results using a process you trust. If some of your critical data processes are repetitive and routine, you may need to source a workforce to gather, process, or enrich data at high volumes.
Deloitte predicted a burgeoning need for the augmented workforce in its 2017 Global Human Capital Trends Report. According to the report, as connectivity and cognitive technology accelerate and change the nature of work, “organizations must reconsider how they design jobs, organize work, and plan for future growth.”
It’s reasonable to expect the structure of your organization to look quite different in 2020 than it does today. It could include teams of cloud workers who manage data. These teams will specialize in functional areas where your core team doesn’t – and where there would be no strategic benefit to their trying to learn them.
Many companies outsource data gathering, cleaning, and enrichment, often for one of two reasons: 1) so data science and engineering teams can redirect their focus to strategy, iteration, and quality; and 2) so operations teams can achieve greater efficiency, quality, or cost with outsourced teams.
When it comes to cleaning data, think carefully about how best to source and deploy every facet of your workforce. Crowdsourcing is a good option for short-term projects but can be inherently inefficient over the long term, as it requires multiple people to do the same task to achieve double or triple consensus on which performs with the highest accuracy. Managed cloud labor can be helpful when your process is clear, you need agility to iterate quickly, and quality is crucial.
Wrapping It All Up
Data is a gold mine for today’s businesses, and it will be even more valuable for those that are strategic about using it. But it comes with a catch: data must be clean, structured, and visible to unlock its business value.
Companies that have developed a standard of governance for data collection, storage, and enrichment are ahead of the game. Those that have shored up their tech-and-human stack to maintain data quality over time are doing even better. And if you’re planning for your augmented workforce, you’re well on your way. Consider it your goal for the year – as it may take you that long – to establish or optimize your process in even one of these areas, and you’ll be ahead of most of your peers.
No matter how you clean your data, you’re not alone. Others like you, along with countless data scientists, are pondering these same issues and finding new ways to clean decades-old data while they apply the latest technology to streamline data collection, cleaning, and enrichment processes. Take heart; technology is bound to make the process easier. And, the lessons you learn along the way will inform your strategy and many of your processes. So get started – the data won’t clean itself! At least, not yet.
This article originally appeared in IoT for All on January 25, 2018. It is the third and final in a series of three articles about dirty data.
Dirty Data AI & Machine Learning Data Cleansing Data Acquisition