Performant machine learning models require high-quality data. And training your machine learning model is not a single, finite stage in your process. Even after you deploy it in a production environment, it’s likely you will need a steady stream of new training data to ensure your model’s predictive accuracy over time.
After all, training data explicitly calls out attributes in a dataset that are representative of ground truth in the outside world, which is continually changing. Without periodic retraining, a model’s accuracy will naturally degrade over time, as real-world variables evolve.
In this article, we’ll discuss why it’s essential to continue to retrain your machine learning models, no matter how rigorous your initial training data process might be. We’ll also discuss approaches for retraining and the advantages of each.
Finally, we’ll cover how you can anticipate the need for subsequent updates at the beginning of any machine learning project. By building in retraining processes from the start, you’ll be designing a sustainable predictive model.
Data Drift and the Need for Retraining
Why do most machine learning models need to be updated to remain accurate? The answer lies in the nature of training data and the way it informs machine learning models’ predictive functions.
Training data is a static dataset, from which machine learning models extrapolate patterns and relationships and form predictions about the future.
As real-world conditions change, training data may be less accurate in its representation of ground truth. Imagine a machine learning model used to predict rental costs in 50 large metro areas. Training data from 2000 to 2019 might predict rental prices for 2020 with impressive accuracy. It would probably be less effective in predicting rental prices for 2050 because the fundamental nature of the housing market is likely to change in the decades to come.
Applying natural language processing (NLP) to train a chatbot provides another useful illustration of data drift. The way we use language is continually evolving, so semantic analysis of the training data that powers a chatbot must be updated to reflect current language. Imagine trying to use training data from the 1980s to train a chatbot to interact with modern consumers. In 40 years, language can change substantially - forcing the need for updated training data.
This phenomenon has been described in several ways, including data drift, concept drift, and model decay. Whatever you call it, it represents a hard truth of machine learning: At some point in the future, your training data will no longer provide a foundation for accurate prediction.
The answer to this inevitable challenge is retraining your model with new or expanded data on a regular basis. Indeed, training your model is an ongoing process, especially if quality is important.
How should you approach updating your machine learning model? In simple terms, you have two options: Manually retraining your model using updated inputs, or building a model designed to learn continuously from new data.
The Manual Approach to Model Retraining
The manual approach to update a machine learning model is to, essentially, duplicate your initial training data processes – but with a newer set of data inputs. In this case, you decide how and when to feed the algorithm new data.
The viability of this option depends on your ability to obtain and prepare new training data on a regular basis. You can monitor your model’s performance over time, determining when an update is necessary. If your model’s accuracy is noticeably degrading, retraining with updated data may be in order.
One advantage of this approach is that tinkering can often lead to insights and innovation. If you monitor your model closely and identify shortcomings, you might discover the value of including additional data or revising your algorithm in more fundamental ways.
The Continual Learning Approach to Model Retraining
Continual learning models incorporate new streams of data, often from the production environment in which they have been deployed.
Consumers engage daily with machine learning models that use continual learning. Consider the music streaming platform Spotify, which uses collaborative filtering to provide recommendations to users based on the preferences of other users with similar tastes, to create value and competitive advantage.
As Spotify users listen to music, data pertaining to their choices is fed back into the company’s predictive algorithms. The resulting feedback loop refines the recommendations the app offers its users and permits high-level personalization, such as machine-generated, personalized playlists. Other leading consumer media service providers , such as Netflix, use similar continual learning systems.
As you might expect, the technical expertise and resources necessary to build these systems are simply out of reach for many organizations. Moreover, you’ll need a steady stream of data ready-made for automatic integration. In a continual learning model, human intervention is possible, but it represents a real bottleneck. Spotify, for example, does not need the data generated by its millions of users to be cleaned or formatted by people before being fed back into its algorithm.
Whether manual updates or continual learning seems more like the more effective (and feasible) option, you’ll need to think strategically about the workforce and technology you’ll use to produce new data for retraining purposes. If you plan to use your model for the foreseeable future, you’ll need the right resources in place to keep it up-to-date.
Anticipating Evolution: Choosing Your Team
Creating training data requires a strategic combination of people, process, and tools. To navigate the ambiguities of gathering, cleaning, and labeling data, you’ll need an efficient tech-and-human stack, that includes skilled people and advanced technology.
Many organizations can’t manage or scale in-house teams to prepare training data, so they seek alternative ways to harness human intelligence. Crowdsourced labor is a common choice, allowing you to tap hundreds of anonymous workers on short notice.
Yet, there are hidden costs that come with anonymous crowdsourcing, including poor communication with workers, which can result in low-quality work. And if these drawbacks are apparent as you develop your initial training dataset, they’ll be especially frustrating as you seek to retrain and update your model moving forward.
With an anonymous group of crowdsourced workers, it’s nearly impossible to exercise oversight or transfer institutional memory. Every time you develop new training data, you run the risk of uncovering new inconsistencies and performance issues.
CloudFactory offers another option: a managed team of CloudWorkers who are ready to transform your data operations. You can engage our workforce of skilled professionals for your specific data needs, scaling up or down as necessary over time. You’ll get the service and communication of a real team with the flexibility of crowdsourced labor, keeping down costs without sacrificing efficiency.
If you aim to keep your machine learning models performant over the long term, you’ll need a workforce flexible enough to support your ongoing training data needs. Take a look at our scalable approach to machine learning, and see how we’ve helped other companies conquer their data challenges, power innovative products, and disrupt their industries.