Friday, December 19, 2014

Big data: problems of correlation, bias, and machine learning


Is correlation enough?


Anderson said:
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. ... There is now a better way. Petabytes allow us to say: "Correlation is enough."

The above graph shows statistically significant correlation between chocolate consumption per capita and number of Nobel laureates in a country. Then, would a country increase the number of Nobel laureates by increasing its chocolate consumption? Meaningless; correlation does not imply causation. 'For example, in Italian cities the number of churches and the number of homicides per year are proportional to the population, which of course does not mean that an increase in the number of churches corresponds  to an increase in the number of homicides, or vice versa!' (Big Data, Complexity and Scientific Method, http://www.syloslabini.info/online/big-data-complexity-and-scientific-method/).

It is possible to find any number of such "spurious" correlations. A good site is: http://www.tylervigen.com/.

However, correlation is not entirely worthless as Eward Taufte (Correlation does not imply causation, Wikipedia) clarifies:
        "Empirically observed covariation is a necessary but not sufficient condition for causality."
        "Correlation is not causation but it sure is a hint."

Prompted by the White House’s Big Data Report and the PCAST Report, the US National Telecommunications and Information Administration requested public comment on big data and consumer privacy in the Internet economy. The Electronic Frontier Foundation's comment of August, 2014 "focused on one main point: that policymakers should be careful and skeptical about claims made for the value of big data, because over-hyping its benefits will likely harm individuals’ privacy." (http://www.ntia.doc.gov/files/ntia/eff.pdf)

The EFF emphasized that Big data analysis can be accurate and effective only if the data collection and analysis are done carefully and purposefully and that for big data analysis to be valid, one must follow rigorous statistical practices.

Simply “collecting it all” and then trying to extract useful information from the data by finding correlations is likely to lead to incorrect (and, depending on the particular application, harmful or even dangerous) results.

The reason being the need for addressing three big data analysis problems before any trade-offs with privacy can be explored:

Problem 1: Sampling Bias

... “that ‘N = All’, and therefore that sampling bias does not matter, is simply not true in most cases that count.” On the contrary, big data sets “are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.”

Correcting for sampling bias is especially important given the digital divide. By assuming that data generated by people’s interactions with devices, apps, and websites are representative of the population as a whole, policy-makers risk unintentionally redlining large parts of the population. Simply put, “with every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets?”

... Simply taking a data set and throwing some statistical or machine learning algorithms at it and assuming “the numbers will speak for themselves” is not only insufficient—it can lead to fundamentally flawed results.

Problem 2: Correlation is Not Causation (And Sometimes, Correlation is Not
Correlation)

Even if one tackles the sampling problem, a fundamental problem with big data is that “although big data is very good at detecting correlations…it never tells us which correlations are meaningful. ...

Even more problematic, however, is the fact that “big data may mean more information, but it also means more false information.” This contributes to what is known as the “multiple-comparisons” problem: if you have a large enough data set, and you do enough comparisons between different variables in the data set, some comparisons that are in fact flukes will appear to be statistically significant.

Problem 3: Fundamental Limitations of Machine Learning

Many computer scientists would argue that one way to combat false correlations is to use more advanced algorithms, such as those involved in machine learning. But even machine learning suffers from some fundamental limitations.

First and foremost, “getting machine learning to work well can be more of an art than a science.”

Second, machine-learning algorithms are just as susceptible to sampling biases as regular statistical techniques, if not more so. The failure of the Google Flu Trends experiment is a prime example of this: machine-learning algorithms are only as good as the data they learn from.  If the underlying data changes, then the machine-learning algorithm cannot be expected to continue functioning correctly. ...

Additionally, many machine-learning techniques are fragile: if their input data is perturbed ever so slightly, the results will change significantly. ... Finally, machine learning, especially model-free learning, is not a valid replacement for more careful statistical analysis (or even machine learning using a model).

EFF concluded that only one particular type of big data analysis could use big data to answer difficult questions and come up with new ways of helping society as a whole. That is:

... analysis that attempts to learn a trend or correlation about a population as a whole (e.g. to identify links between symptoms and a disease, to identify traffic patterns to enable better urban planning, etc.).

According to them, other uses of big data can't escape from the technical problems mentioned earlier.

Other uses of big data by their very nature cannot overcome these technical obstacles. Consider the idea of targeting individuals on a massive scale based on information about them collected for a secondary purpose. By using “found” data that was not intended for the specific use it is being put to, sampling biases are inevitable (i.e. Problem 1).

Or consider the claim by proponents of big data that by “collecting it all” and then storing it indefinitely, they can use the data to learn something new at some distant point in the future. Not only will such a “discovery” likely be subject to sampling biases, but any correlations that are discovered in the data (as opposed to being explicitly tested for) are likely to be spurious (i.e. Problem 2).

At the same time, these sorts of uses (individualized targeting, secondary use of data, indefinite data retention, etc.) pose the greatest privacy threats, since they involve using data for purposes for which consent was not originally given and keeping it longer than otherwise necessary.

No comments:

Post a Comment