Monday, August 26, 2019

Four Ways Data Science Goes Wrong and How Test-Driven Data Analysis Can Help

If, as Niels Bohr maintained, an expert is a person who has made all the mistakes that can be made in a narrow field, we consider ourselves expert data scientists.  After twenty years of doing what’s been variously called statistics, data-mining, analytics and data-science, we have probably made every mistake in the book—bad assumptions about how data reflects reality; imposing our own biases; unjustified statistical inferences and misguided data transformations; poorly generalized deployment; and unforeseen stakeholder consequences.  But at least we’re not alone.
We believe that studying all the ways we get it wrong suggests a powerful “test driven” approach that can help us avoid some of the more egregious mistakes in the future.  By extending the principles of test-driven development, we can prevent some errors altogether and catch others much earlier, all without sacrificing the rapid, iterative, “train of thought” analysis cycle that is fundamental to successful data-science.
Let’s step back.  The successful data scientist applies the traditional scientific method to draw useful conclusions about some phenomenon based on some (perhaps big!) data that reflects it.  Although non-practitioners often view data analysis as a monotonous, mind-numbing process where the analyst feeds in the input data, turns a crank, and produces output, in reality there are many choices to be made along the way, and many pitfalls to catch the unwary.   The “art” of data science is about choosing “interesting questions” to ask of the data: the hypotheses demanded by the scientific method.  These hypotheses are tested, revised and refined, and ultimately lead to conclusions or analytical results: typically charts, tables, predictive models and the like.
Once the analysis is complete, we’re typically left with some kind of software artifact—an “analytical process” that involves a set of steps that transform the input data into well-defined outputs.  Often some or all of that process is later automated and generalized so that updated results can be generated as new data are collected or updated.   But the manner in which an analytical process is created is quite different from how a traditional software program is built.   Unlike a software program, where at least in principle we can specify the desired outcome before we begin, it’s precisely that specification—of the analytical results—that is the objective of data analysis.  We are effectively defining our specification and the software that delivers it simultaneously.  Not only that, the ultimate value of the analysis is critically dependent on how accurately our understanding of the input data and output results relate to the original phenomenon of interest.
Analytical processes can go wrong in all the same ways any piece of software can go wrong, such as crashing or producing obviously incorrect output. Data analysis also offers a plethora of new ways to fail. Insidious errors creep in when our “specification” itself is wrong.  Our process can run correctly in the sense of producing the right kind of output, and not being obviously wrong, but cause us to draw completely invalid conclusions.   These specification errors are often not discovered until much later, if at all. Similarly our process may fail in unexpected ways when presented with new or updated data.
As shown below, we identify four broad categories of analytical process failure, although in practice such a classification will never be perfectly precise:  Anyone familiar with software development will know that in many cases bugs can be (and are!) converted into features simply by documenting the “erroneous” behaviour as part of the spec.
Click to Enlarge
1. Errors of Implementation. The most basic kind of error is where we just get the program wrong—either in obvious ways like multiplying instead of dividing—or in subtler ways like failing to control an accumulation of numerical errors (e.g. a Patriot Missile failure during the first Gulf War that resulted in more than 100 casualties). The twist with data analysis is that it might be quite hard to detect that the results are wrong, especially if they are voluminous.
2. Errors of Interpretation.  Our analysis always depends on the data we consume and produce being correct in two senses: the values must be accurate and they must mean what we think they mean. Even when the first is true, often our misunderstandings and misinterpretations obscure our picture of reality, leading us unknowingly to draw fallacious conclusions.  For example, despite much initial hype Google Flu doesn’t accurately forecast disease outcomes based on search behavior since most people don’t have a good understanding of flu symptoms. Even the questions we ask can be the wrong questions, as Tukey observed:
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question” – J. Tukey, The Future of Statistical Analysis
3. Errors of Process.  Applying statistical methods or inferences correctly often require that specific assumptions be satisfied. Data transformations often have unpredictable consequences in the face of unexpected data (missing or duplicate values being a common problem) and can lead to unjustifiable results.  There are several great collections of how statistics are done wrong, and the spectacular failure of the Mars Orbiter due to incompatible units is a canonical illustration of failure due to different units being mixed used without appropriate conversions.
4. Errors of Applicability. An ad hoc approach is common during initial data exploration. But this can result in an analytical process that is overly specific to the initial dataset, making it difficult to repeat or apply to updated data with slight differences.  Although this sometimes results in easily detectable “crashes”—such as when an unexpected value appears or is missing—it can also lead to otherwise inappropriate conclusions in production. The best known examples of this are overfitting a training dataset, leading to models that don’t perform well in production (e.g. Walmart’s recommendation engine failure), but even analyses not involving predictive modelling often “wire in” assumptions and values, making the analytical process of limited applicability.
So what can be done?  Several years ago, as we began to realize the benefits of Test Driven Development in our traditional software development, we asked ourselves whether a similar methodology could inform and improve our approach to data analysis.  We believe that the principles of test-driven development provide a promising approach to catching and preventing many of these kinds of errors much earlier.  This might well require improvements to the tools we use in order to preserve the speed and flexibility of ad hoc analysis that we’ve come to expect:
  • Traditional test-driven development approaches can be adopted directly to specify (at least post hoc), verify, refactor and automate the steps in our analytical process.  Tests can prove that input data matches our expectations, and that our analysis can be replicated independently of hardware, parallelism, and external state such as passing time and random seeds. The obstacles to wider adoption of this are the difficulty of following the “test-first” ethos of much test-driven development, together with the lack of good tool support for testing much beyond scalar base types. We have a number of ideas about how tool support can be greatly enhanced, and think a more analysis-centric methodology would also help.
  • It seems likely (though not certain) that a richer type system could allow us to capture the otherwise implicit assumptions we make as we perform data transformations.  Such operations commonly treat our data as undifferentiated lists or matrices of basic data types, losing significant context.  For example, consider a table of customers, and another containing their transactions, linked by a customer key.  A traditional database-like approach is fundamentally unable to distinguish the fact that although the  average transaction value  for a customer with no transactions is undefined, their total transaction value should be zero. Richer metadata, including formatting and units would allow tools to apply dimensional analysis ideas to prevent silly mistakes and present output in forms less prone to misinterpretation.
  • Just as programmers developed lint and PyFlakes for checking for clear errors and danger signs in C and Python code respectively, we can begin to see the outline of ideas that would allow an analytical equivalent. Wouldn’t that be something?
We still just beginning to explore these ideas, but they are already delivering tangible value in production environments.  If you’d like to learn more, or share your own experiences, please join the conversation atwww.tdda.info and @tdda.

No comments:

Post a Comment

Racial bias in a medical algorithm favors white patients over sicker black patients

A widely used algorithm that predicts which patients will benefit from extra medical care dramatically underestimates the health needs of...