“Big Data” has become one of the most heralded tools, or at least buzzwords, of this decade. The use of “big data” is trumpeted by many studies in almost all disciplines, and, too an extent, understandably so. Hand-in-hand with the geometric progression of computing power has been the development of algorithms that can parse huge databases and find patterns within them. What concerns me is the increasing conflation of correlation with causation. What these algorithms, and the science of statistics in general, are able to uncover are merely the relationships between variables, with no insight at all into why the variables are related. Amusing, yet sobering, examples of this phenomenon are collected by Tyler Vigen on his site Spurious Correlations. I suggest glancing at a few.
In combinatorics, there is a result known as the Hales-Jewett theorem. In highly simplified layman’s terms, it states that as the dimensionality of an object increases, it is guaranteed that some form of combinatorial structure exists. For example, given a large enough tic-tac-toe board, someone must win as there will certainly be a row/column/diagonal that can be colored all the same color.
What does this have to do with big data? Consider these huge data sets as collections of observations. Each observation is a point in some hyperspace whose co-ordinate axes are the variables of interest and whose positions depend on the values of those variables. Data mining algorithms are let loose in this hyperspace to find patterns and relationships between data points, and almost all algorithms are going to partition the data into a finite-sized set of possible outcomes. Let these outcomes be analogous to the finite set of colors used to color these points in hyperspace. If this is a reasonable model, the Hales-Jewitt theorem states that as the dimensionality increases, it is guaranteed–not convergence in distribution, not convergence in probability, but guaranteed–that there will be some combinatorial structure within this space. In other words, some correlated relationship will be found, even though no causal reason for it exists. It is just that as dimensionality increases, it becomes impossible to be totally uniform.
Maybe I’m wrong in my conceptualization, but this means that much, if not most, of what “big data” algorithms may be finding is not signal but noise. True, it is signal in that the relationships are true, often mathematically guaranteed to exist. Yet it is noise in that the relationships are irrelevant at best, more likely obfuscatory, and possibly downright misleading. Big data algorithms are here to stay, and they will continue to provide insight and value, but in this new world of petabyte data structures, we must be ever more vigilant to not be blinded by the model outputs and approach each finding, no matter how reasonable, with the requisite skepticism. In the ever more important words of George Box, “all models are wrong, some models are useful” Let’s take care not to confuse statistical outcomes with real-world utility.
Fantastic article. Thanks!
Thank You – great article
Interesting article – this sounds like a problem that can be handled by the data scientist. First, if there is a relationship between dimensionality and “false positive” rate (giving it an intuitive term) then couldn’t the significance of this relationship be defined? Secondly, if we are aware of its existence the data scientist should be able to control for this in their research – i.e., test identified relationships on hold-out samples for consistency, incorporate this error rate into traditional statistical diagnostics, etc.