Is correlation
enough?
Anderson said:
But faced with massive data,
this approach to science — hypothesize, model, test — is becoming obsolete. ... There is now a better way. Petabytes allow us to say:
"Correlation is enough."
The above
graph shows statistically significant correlation between chocolate consumption
per capita and number of Nobel laureates in a country. Then, would a country
increase the number of Nobel laureates by increasing its chocolate consumption?
Meaningless; correlation does not imply causation. 'For example, in Italian cities the number of churches and the number of
homicides per year are proportional to the population, which of course does not
mean that an increase in the number of churches corresponds to an
increase in the number of homicides, or vice versa!' (Big Data, Complexity and
Scientific Method, http://www.syloslabini.info/online/big-data-complexity-and-scientific-method/).
It is possible to find any number of such
"spurious" correlations. A good site is: http://www.tylervigen.com/.
However, correlation is not entirely worthless as Eward
Taufte (Correlation does not imply
causation, Wikipedia) clarifies:
□
"Empirically observed covariation is a necessary but not
sufficient condition for causality."
□
"Correlation is not causation but it sure is a hint."
The EFF emphasized that Big data analysis can be accurate
and effective only if the data collection and analysis are done carefully and
purposefully and that for big data analysis to be valid, one must follow rigorous
statistical practices.
Simply “collecting it all” and then trying to extract useful
information from the data by finding correlations is likely to lead to
incorrect (and, depending on the particular application, harmful or even
dangerous) results.
The reason being the need for addressing three big data analysis problems before any
trade-offs with privacy can be explored:
Problem 1: Sampling Bias
... “that
‘N = All’, and therefore that sampling bias does not matter, is simply not true
in most cases that count.” On the contrary, big data sets “are so messy, it can be hard to figure
out what biases lurk inside them – and because they are so large, some analysts
seem to have decided the sampling problem isn’t worth worrying about. It is.”
Correcting
for sampling bias is especially important given the digital divide. By assuming
that data generated by people’s interactions with devices, apps, and websites
are representative of the population as a whole, policy-makers risk
unintentionally redlining large parts of the population. Simply put, “with
every big data set, we need to ask which people are excluded. Which places are
less visible? What happens if you live in the shadow of big data sets?”
...
Simply taking a data set and throwing some statistical or machine learning
algorithms at it and assuming “the numbers will speak for themselves” is not
only insufficient—it can lead to fundamentally flawed results.
Problem
2: Correlation is Not Causation (And Sometimes, Correlation is Not
Correlation)
Even if
one tackles the sampling problem, a fundamental problem with big data is that
“although big data is very good at detecting correlations…it never tells us
which correlations are meaningful. ...
Even more
problematic, however, is the fact that “big data may mean more information, but
it also means more false information.” This contributes to what is known as the
“multiple-comparisons” problem: if you have a large enough data set, and you do
enough comparisons between different variables in the data set, some
comparisons that are in fact flukes will appear to be statistically
significant.
Problem
3: Fundamental Limitations of Machine Learning
Many
computer scientists would argue that one way to combat false correlations is to
use more advanced algorithms, such as those involved in machine learning. But
even machine learning suffers from some fundamental limitations.
First and
foremost, “getting machine learning to work well can be more of an art than a
science.”
Second,
machine-learning algorithms are just as susceptible to sampling biases as
regular statistical techniques, if not more so. The failure of the Google Flu
Trends experiment is a prime example of this: machine-learning algorithms are
only as good as the data they learn from.
If the underlying data changes, then the machine-learning algorithm
cannot be expected to continue functioning correctly. ...
Additionally,
many machine-learning techniques are fragile: if their input data is perturbed
ever so slightly, the results will change significantly. ... Finally, machine
learning, especially model-free learning, is not a valid replacement for more
careful statistical analysis (or even machine learning using a model).
EFF
concluded that only one particular type of big data analysis could use big data
to answer difficult questions and come up with new ways of helping society as a
whole. That is:
...
analysis that attempts to learn a trend or correlation about a population as a
whole (e.g. to identify links between symptoms and a disease, to identify
traffic patterns to enable better urban planning, etc.).
According to them, other uses of big data can't escape from
the technical problems mentioned earlier.
Other
uses of big data by their very nature cannot overcome these technical
obstacles. Consider the idea of targeting individuals on a massive scale based
on information about them collected for a secondary purpose. By using “found”
data that was not intended for the specific use it is being put to, sampling
biases are inevitable (i.e. Problem 1).
Or
consider the claim by proponents of big data that by “collecting it all” and
then storing it indefinitely, they can use the data to learn something new at
some distant point in the future. Not only will such a “discovery” likely be
subject to sampling biases, but any correlations that are discovered in the
data (as opposed to being explicitly tested for) are likely to be spurious
(i.e. Problem 2).
At the same time, these sorts of uses (individualized targeting, secondary use of data, indefinite data retention, etc.) pose the greatest privacy threats, since they involve using data for purposes for which consent was not originally given and keeping it longer than otherwise necessary.
No comments:
Post a Comment