Saturday, December 13, 2014

Big data: End of Theory and Advantage of Late-Entrants


Almost two decades after the 'End of History' by Fukuyama was published in 1989, there was 'The end of theory: the data deluge makes the scientific method obsolete' by Chris Anderson, Editor-in-chief of Wired magazine posted in June 2008 on Wired's website. Here, I'm simply using end of history as a reference to mark the point of time in which the declaration of end of theory was made and also because they sound alike and made sensational news.










        



Advantage of late-entrants?

If that is truly so, that would be the single great promise to us. In conjunction with it, if End of Theory is correct it means we won't need to accumulate and distill the best of the knowledge in the past as building blocks for the knowledge of the present and knowledge for the future. As Anderson put it:
                                                                                             
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. ... But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. ... There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Then there is this story of the two young men who went to learn to play saung (Myanmar harp) from the master. One didn't know anything about harps. The other proudly declared he had learnt to play a bit on his own, upon which the master said "You would pay twice the fee of the other one because with you I will have to make you unlearn what you have learnt wrongly on your own".

So, putting all these together, the solution to our problem of catching up with others seems obvious and simple. Big data would penalize those who are ahead of us in accumulating knowledge, theories and such. We are actually lucky in that our people are lagging behind in learning to do the sciences or most of anything. Now we don't need to waste time making our people unlearn any sciences they have learnt as most others would need to. Just make our people learn big data technologies, data science and information technology, install enough of large computers, enough of sensing devices, and it will be done.
Unfortunately, this rosy scenario would never be. Now, four years after Anderson's denouncement of theory people come to realize (some, or most of them immediately after his assertion) that they would still need to learn their models, hypotheses, and theories in doing their traditional sciences and also learn to do data science and big data.

Many today believe that big data is mostly hype. Worse, we may be "making a big mistake" according to Tim Harford in his article in Financial Times of March 28, 2014 (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-0144feabdc0.html#axzz3LaONI6sM) which pointed out that:

'Five years ago ... (Google was) able to track the spread of influenza across the US. ... could do it more quickly than the Centers for Disease Control and Prevention (CDC) ... tracking had only a day’s delay, compared with the week or more it took for the CDC ... based on reports from doctors’ surgeries ... was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms ... (it was) quick, accurate and cheap, it was theory-free. ... The Google team just took their top 50 million search terms and let the algorithms do the work ... excited journalists asked, can science learn from Google?'

Such successes gave rise to "four articles of faith":

         'that data analysis produces uncannily accurate results'
         'that every single data point can be captured, making old statistical sampling techniques obsolete'
         'that ... statistical correlation tells us what we need to know'
         'that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.'

But four years after the Google Flu's success story, the sad news was that "Google's estimates of the spread of flu-like illnesses were overstated by almost a factor of two". The dominant idea of looking for patters giving rise to the primacy of correlation over causation was the culprit for this failure.

'Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about ­correlation rather than causation. This is common in big data analysis'. 

And Google's algorithms have no way of knowing the change of behavior of people with their internet searches or if they were catching "spurious associations" long recognized by statisticians. In this joke mentioned by BMR of Apr 3, 2014 in the comments to Herford's article you just need to change "misuses of econometrics" to "misuses of big data" to make it current: ... do you remember the decades-old old joke about the misuses of econometrics; "If you torture the data hard enough they will confess!"

But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. One explanation of the Flu Trends failure is that the news was full of scary stories about flu in December 2012 and that these stories provoked internet searches by people who were healthy.

Another possible explanation is that Google’s own search algorithm moved the goalposts when it began automatically suggesting diagnoses when people entered medical symptoms.

He also pointed out the historical lesson of poorly administered large sample in forecasting Roosevelt-Landon presidential election results in 1936. The Literary Digest conducted a postal opinion poll aiming to reach 10 million people, a quarter of the electorate. After tabulating 2.4 million returns they predicted that Landon would win by a convincing 55 per cent to 41 per cent. But the actual result was that Roosevelt crushed Landon by 61 per cent to 37 per cent. In contrast, a small survey of 3000 interviews conducted by the opinion poll pioneer George Gallup came much closer to the final vote, forecasting a comfortable victory for Roosevelt. Lesson: "When it comes to data, size isn’t everything".

In answering a typical dummy's question "But if 3,000 interviews were good, why weren’t 2.4 million far better?" statisticians would answer "Just mind your bias in bigger samples". In Literary Digest's case

'It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories – a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers.' 

The result was a biased sample giving a result biased in favor of Landon as opposed to a sample reflecting the whole population without bias. The statisticians were aware of this for a long time and tried consciously to avoid biased samples. When you take a sample and do your research you are faced with two main sources of error: sampling errors and biases. Statisticians say you could handle the first ones by "doing more of something", that is, take a larger sample; and the second ones by "doing something more", that is, look for sources of bias and try to eliminate them—which means generally you would try to make your sample more representative of the whole population. One big problem with big data is the assumption that N=All or big data = all data which was clearly implied when Anderson said "With enough data, the numbers speak for themselves."

'An example is Twitter. It is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. (In practice, most researchers use a subset of that vast “fire hose” of data.) But while we can look at all the tweets, Twitter users are not representative of the population as a whole. (According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.)

Consider Boston’s Street Bump smartphone app, which uses a phone’s accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface. Solving the technical challenges involved has produced, rather beautifully, an informative data exhaust that addresses a problem in a way that would have been inconceivable a few years ago. The City of Boston proudly proclaims that the “data provides the City with real-time in­formation it uses to fix problems and plan long term investments.”

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. ... That is not the same thing as recording every pothole.'

There was the story of US discount department store Target reported in The New York Times in 2012. According to the report one man stormed into a Target and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised. But later the father found that the teenager was indeed pregnant. Target found it out from analysing her purchases of unscented wipes and magnesium supplements. It was actually a case of a false positive issue meaning that reports from the countless cases of all the women who received coupons for babywear but who weren’t pregnant were not received nor taken into account. Indeed, it could be that pregnant women receive such offers merely because everybody on Target’s mailing list receives such offers!

Yet there is another problem that really threatens big data because it is big. It is the multiple-comparison problem which arises when a researcher looks at many possible patterns in the

'There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

Worse still, one of the antidotes to the ­multiple-comparisons problem is transparency, allowing other researchers to figure out how many hypotheses were tested and how many contrary results are languishing in desk drawers because they just didn’t seem interesting enough to publish. Yet found data sets are rarely transparent. Amazon and Google, Facebook and Twitter, Target and Tesco – these companies aren’t about to share their data with you or anyone else.

Among the comments to Harford's article is this amusing one by Angoisse of April 8, 2014. It tries to discredit the scientific method as based on outdated model from the age of Enlightenment:

'A scientist once taught a spider to jump to the sound of a bell. He plucked a leg and the spider jumped. He removed a second leg and the spider jumped. He removed a third leg, same result. And so on until the spider had no more legs. He rang the bell but the spider did not jump. The scientist thus claimed he had proven that spiders hear through their legs.

Stupid story, maybe... but it points out that hypothesis might be incorrect even though the relationship of events remain. Our way of conceiving and proving theory is based on an outdated model from the Age of Enlightenment. Big Data is coming to challenge our mindsets; lets not vilify it by understanding our natural fear of change, particularly in academia, but also let's not adore it as the new Golden Calf. It is just a new tool at our disposal.'

It's a lousy experiment because at the end you can't separate the effect of the bell on jumping and hearing. It's a nice joke but you can't discredit scientific method that easily.  Actually the lesson of the story is the other way round. Jumping and bell-ringing are clearly correlated so a data scientist à la Anderson will invariably conclude that spiders hear through the legs without much ado!

I started out with Anderson who proclaimed the end of theory and the death of the scientific method. I brought in Harford who was reacting against the methods of the big data in the context of “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast. Google Flu Trends was built on found data and it’s this sort of data that ­interests me here.' 

But there were those who criticized Anderson's idea on the methodology of science such as Massimo Pigliucci, a scientist with doctorates in Genetics, Botany, and Philosophy of Science. In his article "End of Theory in Science?" in The European Molecular Biology Organization Report, June 2009,  (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711825/), Pigliucci said:

'But, if we stop looking for models and hypotheses, are we still really doing science? Science, unlike advertizing, is not about finding patterns—although that is certainly part of the process—it is about finding explanations for those patterns. In fact, it is easy to argue that Anderson is wrong even about advertizing. ... Without models, mathematical or conceptual, data are just noise.
...
Anderson goes on to propose a positive example of the new science he envisions: molecular biology done a la Craig Venter, the entrepreneur scientist. According to Anderson, “Venter has advanced biology more than anyone else of his generation,” and has done so, among other things, by conducting high throughput searches of genomes in the ocean. In fact, Venter has simply collected buckets of water, filtered the material and put the organic content through his high-speed genomic sequencing machines. The results are interesting, including the discovery that there are thousands of previously unknown bacterial species. But, as Anderson points out, “Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip—a unique sequence that, being unlike any other sequence in the database, must represent a new species.” Which means that Venter has succeeded in generating a large amount of data—in response to a specific question, by the way: how many distinct, species-level genome sequences can be found in the oceans? This will surely provide plenty of food for thought for scientists, and a variety of ways to test interesting hypotheses about the structure of the biosphere, the diversity of bacterial life, and so on. But, without those hypotheses to be tested, Venter's data are going to be a useless curiosity, far from being the most important contribution to science in this generation.
... science advances only if it can provide explanations, failing which, it becomes an activity more akin to stamp collecting. Now, there is an area where petabytes of information can be used for their own sake. But please don't call it science.'






No comments:

Post a Comment