Monday, August 26, 2019

Versioning Data Science

Data science, being a relatively new field, has no go-to version control standard. In this post I discuss data science cycles, its unique versioning strategy, and some possible solutions.

Data Science Cycles

Data science is different from traditional software development, especially so for stages such as exploratory data analysis, feature engineering, machine learning model training & validation. While both data science and software development involve writing code, data science tends to be more iterative and cyclical, where one cycle often starts with some initial understanding of the data (and hence questions), and then moves to collecting, exploring, cleaning, and transforming the data, and finally to building, validating, and deploying machine learning models, which in turn leads to better data understanding and the start of the next cycle.
data-science-cycle
Image credit: DataScience.LA
Data science is also more interactive, and we can see this from the tools data scientists use: typically not just an IDE, but notebooks (e.g., JupyterZeppelin, or Beaker), which in its nature is more interactive and enables shorter code-result cycles.
This asks for a different version control paradigm, because simply copying software development VCS (version control system) to data science won’t work (anyone who tried to naively git an .ipynb file knows how painful and inflexible it could be). Data science needs its own versioning system so data scientist & engineers can better collaborate, test, share, and reuse.
Currently, although we are seeing some progress, data science version control is still relatively immature and has a lot of room for improvement. In the rest of this blog post I’ll discuss data science unique versioning strategy.

Versioning Strategy

Reproducibility

The guiding principle of doing data science is that people need to be able to easily see and replicate each other’s work. Cookiecutter Data Science has a great comment on this:
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don’t necessarily have to read 100% of the code before knowing where to look for very specific things.
Reproducibility entails that for every result, keep track of how it was produced, and for analysis that includes randomness, note the underlying random seeds. All custom scripts need to be version controlled.
One crucial thing to realize (yet sometimes overlooked) is that in order to ensure data science reproducibility, it’s not just code that we need to version control (for code versioning tools like GitHubBitbucket, and GibLab should suffice). There are two additional dimensions that need version control as well: data and models.

Data versioning

Compared with code, data is usually harder to version control simply due to one fact: size. One should feel lucky if the data is around a couple of GB, since in this big data age it’s not uncommon to have data at TB or PB in size.
One good practice is to treat data (objects, not that sits in a SQL database) as immutable, meaning don’t ever naively edit or overwrite your raw data, and keep only one version of the raw data.
Sometimes we might want to dump intermediate data to disk, if the data results from some heavy-processing and time-consuming scripts. Be sure to mark well the generating scripts and its version. Also, try not to change data path once it sits in the system otherwise all scripts/notebooks pointing to the data will break.
There are both open-source and commercial solutions for data version control. E.g, cloud vendors like as Amazon and Google provide versioning options in their S3 and cloud storage services so that data scientists can easily recover from unintended actions and app failures. Also, Git Large File Storage is an open source Git extension for versioning large files, and can be set up on a local server. One thing to keep in mind though is that changing a raw file and saving an updated version can easily use up your disk in version control systems (say you have a 5GB file and even changing 1 byte will eat you another 5GB).

Models versioning

Models should be treated as immutable as well. Models can have exponentially many versions as a result of feature engineering, parameter tuning, new data coming in etc., and therefore entails more specifics in version control. Typically, it’s a good idea to follow these rules:
  • Specify a name, version, and the training scripts of the model. Having a uniform naming convention helps: e.g., (datetime)-(model name)-(model version)- (training script id)
  • Make sure file names are unique (to avoid accidental overwriting)
  • Have a global json file to store model names and score mapping (e.g., key being the model name and value being a 3-element tuple, storing training/validation/test set scores)
  • Have a script to serve & rotate models according to a policy (e.g., test set score), and clean up legacy models (those at the bottom x% according to the policy or haven’t been served for x months/years)

Notebook versioning

Notebooks are an integral part of data science job, and we need it to share results (especially plots and findings in exploratory phases).
jupyter
Image credit: Jupyter
Its JSON format, however, is a pain for version control. In practice, there are several ways to deal with this:
Removing notebook output. This can either be done manually by clearing all output cells, or programmatically with library such as nbstripout.
Refactor the good parts. Here the DIY (don’t repeat yourself) principle from software engineering also applies to data science, and data scientist shouldn’t write code to do the same task in multiple notebooks. Say we have some code to read in a .csv file and encode all the non numeric columns. In a notebook we would see something like this:
import pandas as pd

df = pd.read_csv('path/to/data')

### code block ### 
for c in df:
  if df[c].dtype == 'object':
    df[c] = df[c].factorize()[0]
##################
Standard transformation like this would certainly be reused, so instead of copying/pasting the code block into another notebook, we might want to create a utils.py file, refactor the script into a function, and import the function in notebooks next time we use it:
# uitls.py
def encode_object_col(df):
  # put code block here
# johndoe-eda-2017-02-30.ipynb
import pandas as pd
from utils import encode_object_col

df = pd.read_csv('path/to/data')
df =  encode_object_col(df)
Better still, as we add more processing steps, such as removing columns with % missing values, seeding and splitting the data etc., we can have another pipeline-like function to call all those refactored functions, paramterizing each step as binary or threshold and chaining them together:
# utils.py
def make_data_for_ml(*params):
  ... # read in data and process step by step
# johndoe-eda-2017-02-30.ipynb
from utils import make_data_for_ml
X_train, X_test, y_train, y_test = make_data_for_ml(*params)
This seems obvious but sometimes people (including me) are just to lazy to do it ( refactoring requires time and efforts too…). In the long run it will only do you harm, making data science projects messy and harder to manage. So refactor whenever you can, and bearing in mind that the principle is to make code in notebook files as high level as possible, while leveraging VCS to version the comparatively lower-level code in scripts.
The last thing I want to add is naming convention is crucial for notebooks files, since they are usually exploratory scripts we might revisit multiple time(thus requires searching). Common naming pattern is (date-author-topic.ipynb), so we can search by:
  • date (ls 2017-01-*.ipynb)
  • author (ls 2017*-johndoe-*.ipynb)
  • topic (ls *-*-feature_creation.ipynb)
All the fields of course need to follow a common standard (2017-01-30 or 01-30-2017, JohnDoe, JDoe, or jd).

Other good versioning practices

Set up branches. Borrowed from software engineering, branching can be used for data science in multiple-user and multiple-phase situations, where a typical data science phase (ingestion, eda, feature engineering etc.) can span multiple branches.
Use virtual environment. Installing all the software/packages needed can be quite time-consuming, and evolves over time as packages upgrade. Worse, sometimes dependencies break as a result of such upgrade. So it’s always a good idea to test & integrate virtual environment into your versioning system, e.g., using Data Science ToolboxVirtualenvAnaconda, or even Docker container to create a unified environment for versioning data science.
JSONify (hyper)parameters. No matter how tempting it is to put parameters in your scripts, don’t do it. Instead put them into a global JSON file. If seeds were created, put them in this file as well. Better, combine the model score JSON file with this one so your data project will have a single reference to track machine learning models performance, how different parameter settings affect performance, and how models evolve over time.
Create a domain knowledge stream. One thing that often gets overlooked, or at least ill managed, is the versioning of domain knowledge. A crucial part of data science is domain know-how, and how well it is used affects the quality of the data results. Instead of using Excel or Word to keep track of research findings or notes from SME, create a dedicated stream in the versioning system.

Data science versioning solutions

Just like VCS for software development, we face two options of versioning data science: hosted vs on-prem. However, versioning data science is a bit different since data & model privacy and compliance sometimes would complicate things and therefore make hosted solution not an option.
In most cases, hosted service such as Github or Bitbucket plus a cloud storage (S3 or Google Cloud Storage) would lay the foundation for most data science versioning. In case companies don’t want any of their data (even encrypted) sitting in another’s disk, they can set up Git and Git Large File Storage, or other distributed file systems such as HDFS on local servers to achieve the same goal.
No matter which path do we go down, a centralized versioning engine is necessary. Here are a few promising candidates:
Data Version Control (DVC). DVC is a new open source project that makes data science reproducible by automatically building data dependency graphs (DAG). It integrates with Git, AWS S3, Google Storage, and allows sharing code and data separately from a single DVC environment.
dvc
DVC sharing. Image credit: DVC
Pachyderm. Another option is Pachyderm, a tool for production data pipelines. Building on Docker and Kubernetes, Pachyderm can be installed locally and deployed on AWS, GCP, Azure, and more.
Cookiecutter Data Science. Yet a third option is Cookiecutter Data Science, which features as a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. It’ll create a directory structure (a template) for data science project, and builds a DAG for your data flow. Can be installed locally as well.
LuigiLuigi, built by Spotify, is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in. Check out the source code for more.

Conclusion

Data science has its own uniqueness and therefore requires specific versioning strategy & solutions. Hopefully by now you’ve got a high-level picture of how to version data science projects. Time to get your hands dirty and apply those strategy & solutions to your next project. I am excited to learn what you’ll do!

No comments:

Post a Comment

Racial bias in a medical algorithm favors white patients over sicker black patients

A widely used algorithm that predicts which patients will benefit from extra medical care dramatically underestimates the health needs of...