Data science, being a relatively new field, has no go-to version control standard. In this post I discuss data science cycles, its unique versioning strategy, and some possible solutions.
Data Science Cycles
Data science is different from traditional software development, especially so for stages such as exploratory data analysis, feature engineering, machine learning model training & validation. While both data science and software development involve writing code, data science tends to be more iterative and cyclical, where one cycle often starts with some initial understanding of the data (and hence questions), and then moves to collecting, exploring, cleaning, and transforming the data, and finally to building, validating, and deploying machine learning models, which in turn leads to better data understanding and the start of the next cycle.
Image credit: DataScience.LA
Data science is also more interactive, and we can see this from the tools data scientists use: typically not just an IDE, but notebooks (e.g., Jupyter, Zeppelin, or Beaker), which in its nature is more interactive and enables shorter code-result cycles.
This asks for a different version control paradigm, because simply copying software development VCS (version control system) to data science won’t work (anyone who tried to naively
git
an .ipynb file knows how painful and inflexible it could be). Data science needs its own versioning system so data scientist & engineers can better collaborate, test, share, and reuse.
Currently, although we are seeing some progress, data science version control is still relatively immature and has a lot of room for improvement. In the rest of this blog post I’ll discuss data science unique versioning strategy.
Versioning Strategy
Reproducibility
The guiding principle of doing data science is that people need to be able to easily see and replicate each other’s work. Cookiecutter Data Science has a great comment on this:
A well-defined, standard project structure means that a newcomer can begin to understand an analysis without digging in to extensive documentation. It also means that they don’t necessarily have to read 100% of the code before knowing where to look for very specific things.
Reproducibility entails that for every result, keep track of how it was produced, and for analysis that includes randomness, note the underlying random seeds. All custom scripts need to be version controlled.
One crucial thing to realize (yet sometimes overlooked) is that in order to ensure data science reproducibility, it’s not just code that we need to version control (for code versioning tools like GitHub, Bitbucket, and GibLab should suffice). There are two additional dimensions that need version control as well: data and models.
Data versioning
Compared with code, data is usually harder to version control simply due to one fact: size. One should feel lucky if the data is around a couple of GB, since in this big data age it’s not uncommon to have data at TB or PB in size.
One good practice is to treat data (objects, not that sits in a SQL database) as immutable, meaning don’t ever naively edit or overwrite your raw data, and keep only one version of the raw data.
Sometimes we might want to dump intermediate data to disk, if the data results from some heavy-processing and time-consuming scripts. Be sure to mark well the generating scripts and its version. Also, try not to change data path once it sits in the system otherwise all scripts/notebooks pointing to the data will break.
There are both open-source and commercial solutions for data version control. E.g, cloud vendors like as Amazon and Google provide versioning options in their S3 and cloud storage services so that data scientists can easily recover from unintended actions and app failures. Also, Git Large File Storage is an open source Git extension for versioning large files, and can be set up on a local server. One thing to keep in mind though is that changing a raw file and saving an updated version can easily use up your disk in version control systems (say you have a 5GB file and even changing 1 byte will eat you another 5GB).
Models versioning
Models should be treated as immutable as well. Models can have exponentially many versions as a result of feature engineering, parameter tuning, new data coming in etc., and therefore entails more specifics in version control. Typically, it’s a good idea to follow these rules:
- Specify a name, version, and the training scripts of the model. Having a uniform naming convention helps: e.g., (datetime)-(model name)-(model version)- (training script id)
- Make sure file names are unique (to avoid accidental overwriting)
- Have a global json file to store model names and score mapping (e.g., key being the model name and value being a 3-element tuple, storing training/validation/test set scores)
- Have a script to serve & rotate models according to a policy (e.g., test set score), and clean up legacy models (those at the bottom x% according to the policy or haven’t been served for x months/years)
Notebook versioning
Notebooks are an integral part of data science job, and we need it to share results (especially plots and findings in exploratory phases).
Image credit: Jupyter
Its JSON format, however, is a pain for version control. In practice, there are several ways to deal with this:
Removing notebook output. This can either be done manually by clearing all output cells, or programmatically with library such as
nbstripout
.
Refactor the good parts. Here the DIY (don’t repeat yourself) principle from software engineering also applies to data science, and data scientist shouldn’t write code to do the same task in multiple notebooks. Say we have some code to read in a .csv file and encode all the non numeric columns. In a notebook we would see something like this:
import pandas as pd
df = pd.read_csv('path/to/data')
### code block ###
for c in df:
if df[c].dtype == 'object':
df[c] = df[c].factorize()[0]
##################
Standard transformation like this would certainly be reused, so instead of copying/pasting the code block into another notebook, we might want to create a
utils.py
file, refactor the script into a function, and import the function in notebooks next time we use it:# uitls.py
def encode_object_col(df):
# put code block here
# johndoe-eda-2017-02-30.ipynb
import pandas as pd
from utils import encode_object_col
df = pd.read_csv('path/to/data')
df = encode_object_col(df)
Better still, as we add more processing steps, such as removing columns with % missing values, seeding and splitting the data etc., we can have another pipeline-like function to call all those refactored functions, paramterizing each step as binary or threshold and chaining them together:
# utils.py
def make_data_for_ml(*params):
... # read in data and process step by step
# johndoe-eda-2017-02-30.ipynb
from utils import make_data_for_ml
X_train, X_test, y_train, y_test = make_data_for_ml(*params)
This seems obvious but sometimes people (including me) are just to lazy to do it ( refactoring requires time and efforts too…). In the long run it will only do you harm, making data science projects messy and harder to manage. So refactor whenever you can, and bearing in mind that the principle is to make code in notebook files as high level as possible, while leveraging VCS to version the comparatively lower-level code in scripts.
The last thing I want to add is naming convention is crucial for notebooks files, since they are usually exploratory scripts we might revisit multiple time(thus requires searching). Common naming pattern is (date-author-topic.ipynb), so we can search by:
- date (
ls 2017-01-*.ipynb
) - author (
ls 2017*-johndoe-*.ipynb
) - topic (
ls *-*-feature_creation.ipynb
)
All the fields of course need to follow a common standard (2017-01-30 or 01-30-2017, JohnDoe, JDoe, or jd).
Other good versioning practices
Set up branches. Borrowed from software engineering, branching can be used for data science in multiple-user and multiple-phase situations, where a typical data science phase (ingestion, eda, feature engineering etc.) can span multiple branches.
Use virtual environment. Installing all the software/packages needed can be quite time-consuming, and evolves over time as packages upgrade. Worse, sometimes dependencies break as a result of such upgrade. So it’s always a good idea to test & integrate virtual environment into your versioning system, e.g., using Data Science Toolbox, Virtualenv, Anaconda, or even Docker container to create a unified environment for versioning data science.
JSONify (hyper)parameters. No matter how tempting it is to put parameters in your scripts, don’t do it. Instead put them into a global JSON file. If seeds were created, put them in this file as well. Better, combine the model score JSON file with this one so your data project will have a single reference to track machine learning models performance, how different parameter settings affect performance, and how models evolve over time.
Create a domain knowledge stream. One thing that often gets overlooked, or at least ill managed, is the versioning of domain knowledge. A crucial part of data science is domain know-how, and how well it is used affects the quality of the data results. Instead of using Excel or Word to keep track of research findings or notes from SME, create a dedicated stream in the versioning system.
Data science versioning solutions
Just like VCS for software development, we face two options of versioning data science: hosted vs on-prem. However, versioning data science is a bit different since data & model privacy and compliance sometimes would complicate things and therefore make hosted solution not an option.
In most cases, hosted service such as Github or Bitbucket plus a cloud storage (S3 or Google Cloud Storage) would lay the foundation for most data science versioning. In case companies don’t want any of their data (even encrypted) sitting in another’s disk, they can set up Git and Git Large File Storage, or other distributed file systems such as HDFS on local servers to achieve the same goal.
No matter which path do we go down, a centralized versioning engine is necessary. Here are a few promising candidates:
Data Version Control (DVC). DVC is a new open source project that makes data science reproducible by automatically building data dependency graphs (DAG). It integrates with Git, AWS S3, Google Storage, and allows sharing code and data separately from a single DVC environment.
DVC sharing. Image credit: DVC
Pachyderm. Another option is Pachyderm, a tool for production data pipelines. Building on Docker and Kubernetes, Pachyderm can be installed locally and deployed on AWS, GCP, Azure, and more.
Cookiecutter Data Science. Yet a third option is Cookiecutter Data Science, which features as a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. It’ll create a directory structure (a template) for data science project, and builds a DAG for your data flow. Can be installed locally as well.
Luigi. Luigi, built by Spotify, is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in. Check out the source code for more.
Conclusion
Data science has its own uniqueness and therefore requires specific versioning strategy & solutions. Hopefully by now you’ve got a high-level picture of how to version data science projects. Time to get your hands dirty and apply those strategy & solutions to your next project. I am excited to learn what you’ll do!
Source:https://shuaiw.github.io
No comments:
Post a Comment