Finputer___Data science and artificial intelligence: How to version control your production Machine Learning models

Machine Learning is about rapid experimentation and iteration, and without keeping track of your modeling history you won’t be able to learn much. Versioning let’s you keep track of all of your models, how well they’ve done, and what hyperparameters you used to get there. This post will walk through why versioning is important, tools to get it done with, and how to version your models that go into production.

The Importance of Model Versioning

If you’ve spent time working with Machine Learning, one thing is clear: it’s an iterative process. There are so many different parts of your model – how you use your data, hyperparameters, parameters, algorithm choice, architecture – and the optimal combination of all of those is the holy grail of Machine Learning. But while there is some method to the madness, much of finding the right balance is trial and error. Even the best Machine Learning Engineers working on the most complex Deep Learning projects still need to tinker to get their models right.

With that in mind, here are some of the reasons why versioning is so important to Machine Learning projects:

1. Finding the best model

Throughout that iterative process of updating and tinkering with the different parts of your model, your accuracy on your dataset will vary accordingly. In order to keep track of the best models you’ve created and the associated tradeoffs, you need a versioning system.

2. Failure tolerance

When pushing models into production, they can fail for any number of reasons. You want to update your models to take new data into account or incorporate speed improvements, but it’s tough to be sure how they’ll perform in real time. If you do encounter an issue with a production model, you need to be able to revert quickly to the previous working version.

3. Increased complexity and file dependencies

With traditional software versioning, there are only a couple of types of files to keep track of – your code, and your dependencies. With Machine Learning though, things are a bit more complex. First and foremost, you have datasets (typically not part of a normal software deployment). You need to keep track of what data you train and test on, and if that changes over time.

Additionally, saving your models in most of the popular Deep Learning frameworks results in a file that you need to keep track of. Finally, models are often written in different languages and rely on multiple frameworks, which makes dependency tracking even more important.

4. Gradual, staged deployment

If and when you make significant updates to your production models, those changes are rarely deployed immediately and in one shot. To ensure failure tolerance and test appropriately, new models are typically rolled out gradually until teams can be sure that they’re working properly. Versioning gives you the tools to deploy the right versions at the right times.

Versioning Tools to Get The Job Done

It’s hard to understate how nascent the field of production Machine Learning is, and that means the tools supporting this ecosystem are only starting to be fully developed. Here are some of the solutions that practitioners are currently using, and some new entrants too.

1. Git

Ah, ol’ reliable. Git is the versioning protocol used across the board to monitor and version software development and deployment. You might be familiar with GitHub or BitBucket, which are web-based commercial implementations of this open-source tool. Git tracks any changes made to your code and gives you a ton of functionality around implementing, storing, and merging those changes. Pretty much everyone uses it in one way or another.

Source: xkcd

But alas, Git is not without its issues. In addition to the often perplexing nature of using the actual protocol, its missing a lot of the functionality that you need for Machine Learning (because it wasn’t created for Machine Learning!). Git itself doesn’t allow you to track data, changes to model files, and model dependencies. There are extensions that can help, but those solutions are tough to implement and rarely complete.

2. Sandbox environments

Data Scientists often rave about Jupyter Notebooks, a sandbox-type environment that lets you run code in cells and insert Markdown in between (or at least I rave about them). Jupyter Notebooks are like writing a book with code in them: you can be detailed about what each cell does, and organize things in a visually pleasing way. Separating code into cells and sections is a viable way to version your different models.

When it comes to deployment and production though, versioning your models in a notebook doesn’t really cut it. Jupyter Notebooks are a tool for exploration and visualization, not for managing dependencies and tracking minute changes to hyperparameters.

3. Data Version Control (DVC)

Data Version Control (DVC) is a Git extension that adds functionality for managing your code and data together. It works directly with cloud storage (AWS S3 or Google GCP) to push your changes. According to their tutorial, “DVC streamlines large data files and binary models into a single Git environment and this approach will not require storing binary files in your Git repository.” It’s a streamlined version of combining Git with Machine Learning specific functionality.

For a tutorial on how to implement DVC in your project and why it’s so helpful, check out this walkthrough.

4. Commercial solutions

The traditional business wisdom tells us that if there’s a problem, there’s a business. There are a few companies starting out attempting to solve the versioning problem. Comet.ml is an automatic versioning solution that tracks and organizes all of your team’s modeling efforts. You can easily compare experiments, see the differences in code between two models, and invite team members to collaborate on a project.

5. Platforms as a Service and Algorithmia

Even once you’ve found a way to manage versioning during your training and experimentation process, much of the complexity resides in inference: deploying the right models in the right places at the right times. If you’re using a Platform as a Service to deploy your Machine Learning models, it might offer some functionality around versioning.

If you’re deployed on the Algorithmia platform, we productionize your models as independent microservices with individual endpoints. That means you can continue to reference historical versions of your models in production without having to worry about them breaking or getting deprecated. It’s as simple as appending the model name in our API with a version number.

Monday, August 26, 2019

How to version control your production Machine Learning models

The Importance of Model Versioning

1. Finding the best model

2. Failure tolerance

3. Increased complexity and file dependencies

4. Gradual, staged deployment

Versioning Tools to Get The Job Done

1. Git

2. Sandbox environments

3. Data Version Control (DVC)

4. Commercial solutions

5. Platforms as a Service and Algorithmia

Further Reading

No comments:

Post a Comment

Racial bias in a medical algorithm favors white patients over sicker black patients

Blog Archive