Almost two decades
after the 'End of History' by
Fukuyama was published in 1989, there was 'The end of theory: the data
deluge makes the scientific method obsolete' by Chris Anderson, Editor-in-chief of Wired magazine posted in June 2008 on Wired's website. Here, I'm simply using end of history as a reference to mark
the point of time in which the declaration of end of theory was made and also because they sound alike and made
sensational news.
Advantage of late-entrants?
If that is truly so, that would be the
single great promise to us. In conjunction with it, if End of Theory is correct it means we won't need to accumulate and
distill the best of the knowledge in the past as building blocks for the
knowledge of the present and knowledge for the future. As Anderson put it:
This is a world where
massive amounts of data and applied mathematics replace every other tool that
might be brought to bear. Out with every theory of human behavior, from
linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows
why people do what they do? The point is they do it, and we can track and
measure it with unprecedented fidelity. With enough data, the numbers speak for
themselves. ... But faced with massive data,
this approach to science — hypothesize, model, test — is becoming obsolete.
... There is now a better way. Petabytes allow
us to say: "Correlation is enough." We can stop looking for models.
We can analyze the data without hypotheses about what it might show. We can
throw the numbers into the biggest computing clusters the world has ever seen
and let statistical algorithms find patterns where science cannot.
Then there is this story of the two
young men who went to learn to play saung
(Myanmar harp) from the master. One didn't know anything about harps. The other
proudly declared he had learnt to play a bit on his own, upon which the master
said "You would pay twice the fee of the other one because with you I will
have to make you unlearn what you have learnt wrongly on your own".
So, putting all these together, the
solution to our problem of catching up with others seems obvious and simple. Big
data would penalize those who are ahead of us in accumulating knowledge,
theories and such. We are actually lucky in that our people are lagging behind
in learning to do the sciences or most of anything. Now we don't need to waste
time making our people unlearn any sciences they have learnt as most others
would need to. Just make our people learn big data technologies, data science
and information technology, install enough of large computers, enough of
sensing devices, and it will be done.
Unfortunately, this rosy scenario
would never be. Now, four years after Anderson's denouncement of theory people
come to realize (some, or most of them immediately after his assertion) that
they would still need to learn their models, hypotheses, and theories in doing their
traditional sciences and also learn to do data science and big data.
'Five years ago ... (Google was) able to track the spread of
influenza across the US. ... could do it more quickly than the Centers for
Disease Control and Prevention (CDC) ... tracking had only a day’s delay,
compared with the week or more it took for the CDC ... based on reports from
doctors’ surgeries ... was faster because it was tracking the outbreak by
finding a correlation between what people searched for online and whether they
had flu symptoms ... (it was) quick, accurate and cheap, it was theory-free. ...
The Google team just took their top 50 million search terms and let the
algorithms do the work ... excited journalists asked, can science learn from
Google?'
Such successes gave rise to
"four articles of faith":
□
'that data analysis produces uncannily accurate results'
□
'that every single data point can be captured, making old
statistical sampling techniques obsolete'
□
'that ... statistical correlation tells us what we need to know'
□
'that scientific or statistical models aren’t needed because, to
quote “The End of Theory”, a provocative essay published in Wired in 2008,
“with enough data, the numbers speak for themselves”.'
But four years after the Google Flu's success story, the sad
news was that "Google's estimates of
the spread of flu-like illnesses were overstated by
almost a factor of two". The dominant idea of looking for patters giving
rise to the primacy of correlation over causation was the culprit for this
failure.
'Google’s engineers weren’t trying to figure out
what caused what. They were merely finding statistical patterns in the data.
They cared about correlation rather than causation. This is common in big data
analysis'.
And Google's algorithms have no way of knowing the change of
behavior of people with their internet searches or if they were catching
"spurious associations" long recognized by statisticians. In this
joke mentioned by BMR of Apr 3, 2014 in
the comments to Herford's article you just need to change "misuses of econometrics" to "misuses of big data" to make it
current: ... do you
remember the decades-old old joke about the misuses of econometrics; "If
you torture the data hard enough they will confess!"
But a theory-free analysis of mere correlations is inevitably
fragile. If you have no idea what is behind a correlation, you have no idea
what might cause that correlation to break down. One explanation of the Flu
Trends failure is that the news was full of scary stories about flu in December
2012 and that these stories provoked internet searches by people who were
healthy.
Another possible explanation is that Google’s own search algorithm
moved the goalposts when it began automatically suggesting diagnoses when
people entered medical symptoms.
He also pointed out the historical lesson of poorly administered large sample in forecasting
Roosevelt-Landon presidential election results in 1936. The Literary Digest conducted
a postal opinion poll aiming to reach 10 million people, a quarter of the
electorate. After tabulating 2.4 million returns they predicted that Landon
would win by a convincing 55 per cent to 41 per cent. But the actual result was
that Roosevelt crushed Landon by 61 per cent to 37 per cent. In contrast, a small
survey of 3000 interviews conducted by the opinion poll pioneer George Gallup
came much closer to the final vote, forecasting a comfortable victory for
Roosevelt. Lesson: "When
it comes to data, size isn’t everything".
In answering a typical dummy's question "But if 3,000
interviews were good, why weren’t 2.4 million far better?" statisticians
would answer "Just mind your bias in bigger samples". In Literary
Digest's case
'It mailed out forms to people on a list it had
compiled from automobile registrations and telephone directories – a sample
that, at least in 1936, was disproportionately prosperous. To compound the
problem, Landon supporters turned out to be more likely to mail back their
answers.'
The result was a biased
sample giving a result biased in favor of Landon as opposed to a sample
reflecting the whole population without bias. The statisticians were aware of
this for a long time and tried consciously to avoid biased samples. When you
take a sample and do your research you are faced with two main sources of
error: sampling errors and biases. Statisticians say you could handle the first
ones by "doing more of something", that is, take a larger sample; and
the second ones by "doing something more", that is, look for sources
of bias and try to eliminate them—which means generally you would try to make
your sample more representative of the whole population. One big problem with
big data is the assumption that N=All or big
data = all data which was clearly implied when Anderson said "With enough data, the
numbers speak for themselves."
'An example is Twitter. It is in principle
possible to record and analyse every message on Twitter and use it to draw
conclusions about the public mood. (In practice, most researchers use a subset
of that vast “fire hose” of data.) But while we can look at all the tweets,
Twitter users are not representative of the population as a whole. (According
to the Pew Research Internet Project, in 2013, US-based Twitter users were
disproportionately young, urban or suburban, and black.)
Consider Boston’s Street Bump smartphone app, which uses a phone’s
accelerometer to detect potholes without the need for city workers to patrol
the streets. As citizens of Boston download the app and drive around, their
phones automatically notify City Hall of the need to repair the road surface.
Solving the technical challenges involved has produced, rather beautifully, an
informative data exhaust that addresses a problem in a way that would have been
inconceivable a few years ago. The City of Boston proudly proclaims that the
“data provides the City with real-time information it uses to fix problems and
plan long term investments.”
Yet what Street Bump really produces, left to its own
devices, is a map of potholes that systematically favours young, affluent areas
where more people own smartphones. ...
That is not the same thing as recording every pothole.'
There was the story of US discount department store Target
reported in The New York Times in 2012. According to the report one man stormed
into a Target and complained to the manager that the company was sending
coupons for baby clothes and maternity wear to his teenage daughter. The
manager apologised. But later the father found that the teenager was indeed
pregnant. Target found it out from analysing her purchases of unscented wipes
and magnesium supplements. It was actually a case of a false positive issue meaning that reports from the countless cases
of all the women who received coupons for babywear but who weren’t pregnant
were not received nor taken into account. Indeed, it could
be that pregnant women receive such offers merely because everybody on Target’s
mailing list receives such offers!
Yet there is another problem that really threatens big data
because it is big. It is the multiple-comparison
problem which arises when a researcher looks at many possible patterns in the data. Test enough
different correlations and fluke results will drown out the real discoveries.
'There are various ways to deal with this but the problem is
more serious in large data sets, because there are vastly more possible
comparisons than there are data points to compare. Without careful analysis,
the ratio of genuine patterns to spurious patterns – of signal to noise –
quickly tends to zero.
Worse still, one of the antidotes to the multiple-comparisons
problem is transparency, allowing other researchers to figure out how many
hypotheses were tested and how many contrary results are languishing in desk
drawers because they just didn’t seem interesting enough to publish. Yet found
data sets are rarely transparent. Amazon and Google, Facebook and Twitter,
Target and Tesco – these companies aren’t about to share their data with you or
anyone else.
Among the comments to Harford's article is this amusing one
by Angoisse of April 8, 2014. It tries
to discredit the scientific method as based on outdated model from the age of
Enlightenment:
'A scientist
once taught a spider to jump to the sound of a bell. He plucked a leg and the
spider jumped. He removed a second leg and the spider jumped. He removed a
third leg, same result. And so on until the spider had no more legs. He rang
the bell but the spider did not jump. The scientist thus claimed he had proven
that spiders hear through their legs.
Stupid story, maybe... but it points out that hypothesis might be incorrect
even though the relationship of events remain. Our way of conceiving and
proving theory is based on an outdated model from the Age of Enlightenment. Big
Data is coming to challenge our mindsets; lets not vilify it by understanding
our natural fear of change, particularly in academia, but also let's not adore
it as the new Golden Calf. It is just a new tool at our disposal.'
It's a lousy experiment because at the end you can't separate
the effect of the bell on jumping and hearing. It's a nice joke but you can't
discredit scientific method that easily.
Actually the lesson of the story is the other way round. Jumping and
bell-ringing are clearly correlated
so a data scientist à la Anderson
will invariably conclude that spiders hear through the legs without much ado!
I started out with Anderson who proclaimed the end of theory
and the death of the scientific method. I brought in Harford who was reacting
against the methods of the big data
in the context of “big data”
that interests many companies is what we might call “found data”, the digital
exhaust of web searches, credit card payments and mobiles pinging the nearest
phone mast. Google Flu Trends was built on found data and it’s this sort of
data that interests me here.'
But there were those who criticized
Anderson's idea on the
methodology of science such as Massimo Pigliucci, a scientist
with doctorates in
Genetics, Botany, and Philosophy of Science. In his article
"End of Theory in Science?" in
The
European Molecular Biology Organization Report, June 2009, (
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711825/),
Pigliucci
said:
'But, if we stop looking for
models and hypotheses, are we still really doing science? Science, unlike
advertizing, is not about finding patterns—although that is certainly part of
the process—it is about finding explanations for those patterns. In fact, it is
easy to argue that Anderson is wrong even about advertizing. ... Without
models, mathematical or conceptual, data are just noise.
...
Anderson goes on to
propose a positive example of the new science he envisions: molecular biology
done a la Craig
Venter, the entrepreneur scientist. According to Anderson, “Venter has advanced
biology more than anyone else of his generation,” and has done so, among other
things, by conducting high throughput searches of genomes in the ocean. In
fact, Venter has simply collected buckets of water, filtered the material and
put the organic content through his high-speed genomic sequencing machines. The
results are interesting, including the discovery that there are thousands of
previously unknown bacterial species. But, as Anderson points out, “Venter can
tell you almost nothing about the species he found. He doesn't know what they
look like, how they live, or much of anything else about their morphology. He
doesn't even have their entire genome. All he has is a statistical blip—a
unique sequence that, being unlike any other sequence in the database, must
represent a new species.” Which means that Venter has succeeded in generating a
large amount of data—in response to a specific question, by the way: how many
distinct, species-level genome sequences can be found in the oceans? This will
surely provide plenty of food for thought for scientists, and a variety of ways
to test interesting hypotheses about the structure of the biosphere, the
diversity of bacterial life, and so on. But, without those hypotheses to be
tested, Venter's data are going to be a useless curiosity, far from being the
most important contribution to science in this generation.
... science advances only if it can provide explanations,
failing which, it becomes an activity more akin to stamp collecting. Now, there
is an area where petabytes of information can be used for their own sake. But
please don't call it science.'