Friday, December 19, 2014

Big data: problems of correlation, bias, and machine learning


Is correlation enough?


Anderson said:
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. ... There is now a better way. Petabytes allow us to say: "Correlation is enough."

The above graph shows statistically significant correlation between chocolate consumption per capita and number of Nobel laureates in a country. Then, would a country increase the number of Nobel laureates by increasing its chocolate consumption? Meaningless; correlation does not imply causation. 'For example, in Italian cities the number of churches and the number of homicides per year are proportional to the population, which of course does not mean that an increase in the number of churches corresponds  to an increase in the number of homicides, or vice versa!' (Big Data, Complexity and Scientific Method, http://www.syloslabini.info/online/big-data-complexity-and-scientific-method/).

It is possible to find any number of such "spurious" correlations. A good site is: http://www.tylervigen.com/.

However, correlation is not entirely worthless as Eward Taufte (Correlation does not imply causation, Wikipedia) clarifies:
        "Empirically observed covariation is a necessary but not sufficient condition for causality."
        "Correlation is not causation but it sure is a hint."

Prompted by the White House’s Big Data Report and the PCAST Report, the US National Telecommunications and Information Administration requested public comment on big data and consumer privacy in the Internet economy. The Electronic Frontier Foundation's comment of August, 2014 "focused on one main point: that policymakers should be careful and skeptical about claims made for the value of big data, because over-hyping its benefits will likely harm individuals’ privacy." (http://www.ntia.doc.gov/files/ntia/eff.pdf)

The EFF emphasized that Big data analysis can be accurate and effective only if the data collection and analysis are done carefully and purposefully and that for big data analysis to be valid, one must follow rigorous statistical practices.

Simply “collecting it all” and then trying to extract useful information from the data by finding correlations is likely to lead to incorrect (and, depending on the particular application, harmful or even dangerous) results.

The reason being the need for addressing three big data analysis problems before any trade-offs with privacy can be explored:

Problem 1: Sampling Bias

... “that ‘N = All’, and therefore that sampling bias does not matter, is simply not true in most cases that count.” On the contrary, big data sets “are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.”

Correcting for sampling bias is especially important given the digital divide. By assuming that data generated by people’s interactions with devices, apps, and websites are representative of the population as a whole, policy-makers risk unintentionally redlining large parts of the population. Simply put, “with every big data set, we need to ask which people are excluded. Which places are less visible? What happens if you live in the shadow of big data sets?”

... Simply taking a data set and throwing some statistical or machine learning algorithms at it and assuming “the numbers will speak for themselves” is not only insufficient—it can lead to fundamentally flawed results.

Problem 2: Correlation is Not Causation (And Sometimes, Correlation is Not
Correlation)

Even if one tackles the sampling problem, a fundamental problem with big data is that “although big data is very good at detecting correlations…it never tells us which correlations are meaningful. ...

Even more problematic, however, is the fact that “big data may mean more information, but it also means more false information.” This contributes to what is known as the “multiple-comparisons” problem: if you have a large enough data set, and you do enough comparisons between different variables in the data set, some comparisons that are in fact flukes will appear to be statistically significant.

Problem 3: Fundamental Limitations of Machine Learning

Many computer scientists would argue that one way to combat false correlations is to use more advanced algorithms, such as those involved in machine learning. But even machine learning suffers from some fundamental limitations.

First and foremost, “getting machine learning to work well can be more of an art than a science.”

Second, machine-learning algorithms are just as susceptible to sampling biases as regular statistical techniques, if not more so. The failure of the Google Flu Trends experiment is a prime example of this: machine-learning algorithms are only as good as the data they learn from.  If the underlying data changes, then the machine-learning algorithm cannot be expected to continue functioning correctly. ...

Additionally, many machine-learning techniques are fragile: if their input data is perturbed ever so slightly, the results will change significantly. ... Finally, machine learning, especially model-free learning, is not a valid replacement for more careful statistical analysis (or even machine learning using a model).

EFF concluded that only one particular type of big data analysis could use big data to answer difficult questions and come up with new ways of helping society as a whole. That is:

... analysis that attempts to learn a trend or correlation about a population as a whole (e.g. to identify links between symptoms and a disease, to identify traffic patterns to enable better urban planning, etc.).

According to them, other uses of big data can't escape from the technical problems mentioned earlier.

Other uses of big data by their very nature cannot overcome these technical obstacles. Consider the idea of targeting individuals on a massive scale based on information about them collected for a secondary purpose. By using “found” data that was not intended for the specific use it is being put to, sampling biases are inevitable (i.e. Problem 1).

Or consider the claim by proponents of big data that by “collecting it all” and then storing it indefinitely, they can use the data to learn something new at some distant point in the future. Not only will such a “discovery” likely be subject to sampling biases, but any correlations that are discovered in the data (as opposed to being explicitly tested for) are likely to be spurious (i.e. Problem 2).

At the same time, these sorts of uses (individualized targeting, secondary use of data, indefinite data retention, etc.) pose the greatest privacy threats, since they involve using data for purposes for which consent was not originally given and keeping it longer than otherwise necessary.

Saturday, December 13, 2014

Big data: End of Theory and Advantage of Late-Entrants


Almost two decades after the 'End of History' by Fukuyama was published in 1989, there was 'The end of theory: the data deluge makes the scientific method obsolete' by Chris Anderson, Editor-in-chief of Wired magazine posted in June 2008 on Wired's website. Here, I'm simply using end of history as a reference to mark the point of time in which the declaration of end of theory was made and also because they sound alike and made sensational news.










        



Advantage of late-entrants?

If that is truly so, that would be the single great promise to us. In conjunction with it, if End of Theory is correct it means we won't need to accumulate and distill the best of the knowledge in the past as building blocks for the knowledge of the present and knowledge for the future. As Anderson put it:
                                                                                             
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. ... But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. ... There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Then there is this story of the two young men who went to learn to play saung (Myanmar harp) from the master. One didn't know anything about harps. The other proudly declared he had learnt to play a bit on his own, upon which the master said "You would pay twice the fee of the other one because with you I will have to make you unlearn what you have learnt wrongly on your own".

So, putting all these together, the solution to our problem of catching up with others seems obvious and simple. Big data would penalize those who are ahead of us in accumulating knowledge, theories and such. We are actually lucky in that our people are lagging behind in learning to do the sciences or most of anything. Now we don't need to waste time making our people unlearn any sciences they have learnt as most others would need to. Just make our people learn big data technologies, data science and information technology, install enough of large computers, enough of sensing devices, and it will be done.
Unfortunately, this rosy scenario would never be. Now, four years after Anderson's denouncement of theory people come to realize (some, or most of them immediately after his assertion) that they would still need to learn their models, hypotheses, and theories in doing their traditional sciences and also learn to do data science and big data.

Many today believe that big data is mostly hype. Worse, we may be "making a big mistake" according to Tim Harford in his article in Financial Times of March 28, 2014 (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-0144feabdc0.html#axzz3LaONI6sM) which pointed out that:

'Five years ago ... (Google was) able to track the spread of influenza across the US. ... could do it more quickly than the Centers for Disease Control and Prevention (CDC) ... tracking had only a day’s delay, compared with the week or more it took for the CDC ... based on reports from doctors’ surgeries ... was faster because it was tracking the outbreak by finding a correlation between what people searched for online and whether they had flu symptoms ... (it was) quick, accurate and cheap, it was theory-free. ... The Google team just took their top 50 million search terms and let the algorithms do the work ... excited journalists asked, can science learn from Google?'

Such successes gave rise to "four articles of faith":

         'that data analysis produces uncannily accurate results'
         'that every single data point can be captured, making old statistical sampling techniques obsolete'
         'that ... statistical correlation tells us what we need to know'
         'that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.'

But four years after the Google Flu's success story, the sad news was that "Google's estimates of the spread of flu-like illnesses were overstated by almost a factor of two". The dominant idea of looking for patters giving rise to the primacy of correlation over causation was the culprit for this failure.

'Google’s engineers weren’t trying to figure out what caused what. They were merely finding statistical patterns in the data. They cared about ­correlation rather than causation. This is common in big data analysis'. 

And Google's algorithms have no way of knowing the change of behavior of people with their internet searches or if they were catching "spurious associations" long recognized by statisticians. In this joke mentioned by BMR of Apr 3, 2014 in the comments to Herford's article you just need to change "misuses of econometrics" to "misuses of big data" to make it current: ... do you remember the decades-old old joke about the misuses of econometrics; "If you torture the data hard enough they will confess!"

But a theory-free analysis of mere correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. One explanation of the Flu Trends failure is that the news was full of scary stories about flu in December 2012 and that these stories provoked internet searches by people who were healthy.

Another possible explanation is that Google’s own search algorithm moved the goalposts when it began automatically suggesting diagnoses when people entered medical symptoms.

He also pointed out the historical lesson of poorly administered large sample in forecasting Roosevelt-Landon presidential election results in 1936. The Literary Digest conducted a postal opinion poll aiming to reach 10 million people, a quarter of the electorate. After tabulating 2.4 million returns they predicted that Landon would win by a convincing 55 per cent to 41 per cent. But the actual result was that Roosevelt crushed Landon by 61 per cent to 37 per cent. In contrast, a small survey of 3000 interviews conducted by the opinion poll pioneer George Gallup came much closer to the final vote, forecasting a comfortable victory for Roosevelt. Lesson: "When it comes to data, size isn’t everything".

In answering a typical dummy's question "But if 3,000 interviews were good, why weren’t 2.4 million far better?" statisticians would answer "Just mind your bias in bigger samples". In Literary Digest's case

'It mailed out forms to people on a list it had compiled from automobile registrations and telephone directories – a sample that, at least in 1936, was disproportionately prosperous. To compound the problem, Landon supporters turned out to be more likely to mail back their answers.' 

The result was a biased sample giving a result biased in favor of Landon as opposed to a sample reflecting the whole population without bias. The statisticians were aware of this for a long time and tried consciously to avoid biased samples. When you take a sample and do your research you are faced with two main sources of error: sampling errors and biases. Statisticians say you could handle the first ones by "doing more of something", that is, take a larger sample; and the second ones by "doing something more", that is, look for sources of bias and try to eliminate them—which means generally you would try to make your sample more representative of the whole population. One big problem with big data is the assumption that N=All or big data = all data which was clearly implied when Anderson said "With enough data, the numbers speak for themselves."

'An example is Twitter. It is in principle possible to record and analyse every message on Twitter and use it to draw conclusions about the public mood. (In practice, most researchers use a subset of that vast “fire hose” of data.) But while we can look at all the tweets, Twitter users are not representative of the population as a whole. (According to the Pew Research Internet Project, in 2013, US-based Twitter users were disproportionately young, urban or suburban, and black.)

Consider Boston’s Street Bump smartphone app, which uses a phone’s accelerometer to detect potholes without the need for city workers to patrol the streets. As citizens of Boston download the app and drive around, their phones automatically notify City Hall of the need to repair the road surface. Solving the technical challenges involved has produced, rather beautifully, an informative data exhaust that addresses a problem in a way that would have been inconceivable a few years ago. The City of Boston proudly proclaims that the “data provides the City with real-time in­formation it uses to fix problems and plan long term investments.”

Yet what Street Bump really produces, left to its own devices, is a map of potholes that systematically favours young, affluent areas where more people own smartphones. ... That is not the same thing as recording every pothole.'

There was the story of US discount department store Target reported in The New York Times in 2012. According to the report one man stormed into a Target and complained to the manager that the company was sending coupons for baby clothes and maternity wear to his teenage daughter. The manager apologised. But later the father found that the teenager was indeed pregnant. Target found it out from analysing her purchases of unscented wipes and magnesium supplements. It was actually a case of a false positive issue meaning that reports from the countless cases of all the women who received coupons for babywear but who weren’t pregnant were not received nor taken into account. Indeed, it could be that pregnant women receive such offers merely because everybody on Target’s mailing list receives such offers!

Yet there is another problem that really threatens big data because it is big. It is the multiple-comparison problem which arises when a researcher looks at many possible patterns in the

'There are various ways to deal with this but the problem is more serious in large data sets, because there are vastly more possible comparisons than there are data points to compare. Without careful analysis, the ratio of genuine patterns to spurious patterns – of signal to noise – quickly tends to zero.

Worse still, one of the antidotes to the ­multiple-comparisons problem is transparency, allowing other researchers to figure out how many hypotheses were tested and how many contrary results are languishing in desk drawers because they just didn’t seem interesting enough to publish. Yet found data sets are rarely transparent. Amazon and Google, Facebook and Twitter, Target and Tesco – these companies aren’t about to share their data with you or anyone else.

Among the comments to Harford's article is this amusing one by Angoisse of April 8, 2014. It tries to discredit the scientific method as based on outdated model from the age of Enlightenment:

'A scientist once taught a spider to jump to the sound of a bell. He plucked a leg and the spider jumped. He removed a second leg and the spider jumped. He removed a third leg, same result. And so on until the spider had no more legs. He rang the bell but the spider did not jump. The scientist thus claimed he had proven that spiders hear through their legs.

Stupid story, maybe... but it points out that hypothesis might be incorrect even though the relationship of events remain. Our way of conceiving and proving theory is based on an outdated model from the Age of Enlightenment. Big Data is coming to challenge our mindsets; lets not vilify it by understanding our natural fear of change, particularly in academia, but also let's not adore it as the new Golden Calf. It is just a new tool at our disposal.'

It's a lousy experiment because at the end you can't separate the effect of the bell on jumping and hearing. It's a nice joke but you can't discredit scientific method that easily.  Actually the lesson of the story is the other way round. Jumping and bell-ringing are clearly correlated so a data scientist à la Anderson will invariably conclude that spiders hear through the legs without much ado!

I started out with Anderson who proclaimed the end of theory and the death of the scientific method. I brought in Harford who was reacting against the methods of the big data in the context of “big data” that interests many companies is what we might call “found data”, the digital exhaust of web searches, credit card payments and mobiles pinging the nearest phone mast. Google Flu Trends was built on found data and it’s this sort of data that ­interests me here.' 

But there were those who criticized Anderson's idea on the methodology of science such as Massimo Pigliucci, a scientist with doctorates in Genetics, Botany, and Philosophy of Science. In his article "End of Theory in Science?" in The European Molecular Biology Organization Report, June 2009,  (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2711825/), Pigliucci said:

'But, if we stop looking for models and hypotheses, are we still really doing science? Science, unlike advertizing, is not about finding patterns—although that is certainly part of the process—it is about finding explanations for those patterns. In fact, it is easy to argue that Anderson is wrong even about advertizing. ... Without models, mathematical or conceptual, data are just noise.
...
Anderson goes on to propose a positive example of the new science he envisions: molecular biology done a la Craig Venter, the entrepreneur scientist. According to Anderson, “Venter has advanced biology more than anyone else of his generation,” and has done so, among other things, by conducting high throughput searches of genomes in the ocean. In fact, Venter has simply collected buckets of water, filtered the material and put the organic content through his high-speed genomic sequencing machines. The results are interesting, including the discovery that there are thousands of previously unknown bacterial species. But, as Anderson points out, “Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip—a unique sequence that, being unlike any other sequence in the database, must represent a new species.” Which means that Venter has succeeded in generating a large amount of data—in response to a specific question, by the way: how many distinct, species-level genome sequences can be found in the oceans? This will surely provide plenty of food for thought for scientists, and a variety of ways to test interesting hypotheses about the structure of the biosphere, the diversity of bacterial life, and so on. But, without those hypotheses to be tested, Venter's data are going to be a useless curiosity, far from being the most important contribution to science in this generation.
... science advances only if it can provide explanations, failing which, it becomes an activity more akin to stamp collecting. Now, there is an area where petabytes of information can be used for their own sake. But please don't call it science.'






Thursday, December 11, 2014

Big data: small guys could do it?


After reading through quite a bit of discussions, tutorials, reports, blogs, primers, Q&A's, proceedings, popular articles, and Wiki pages about big data and convinced that it could do really good things for development, I felt I need to have some hands-on experience with it. Now, the question is: could some small guy with a moderately powerful laptop with some knowledge of R and a slow internet connection do it?  After all, most of what I have read seems to give you "Don't try this at home" kind of warning. One says working with big data requires "massively parallel software running on tens, hundreds, or even thousands of servers".

And then really large data in the terabyte range or larger are usually handled by Apache Hadoop software. Wikipedia describes Hadoop as "an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware". The underlying idea is to "split" the data into manageable sizes, do your calculations on them separately at the same time ("apply"), and "combine" the results.

I read the Wiki page on big data and it said: "Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization". My interest principally is analysis and perhaps visualization as a part of analysis. So I looked for examples of analysis of data too big to handle with one laptop using open source statistical software like R.

At the same time I was aware of the fact that though R is an excellent statistical environment its limitation is that it needed to have all the data it meant to process entirely in the memory of the computer. According to Kane et al (Scalable Strategies for Computing with Massive Data, Journal of Statistical Software, November 2013, Volume 55, Issue 14) for using with R a data set should be considered large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Then I found the article "bigglm on your big data set in open source R, it just works – similar as in SAS" at http://kadimbilgi.blogspot.com/2012/11/bigglm-on-your-big-data-set-in-open.html. The author (Bilgi) said

"In a recent post by Revolution Analytics (link & link) in which Revolution was benchmarking their closed source generalized linear model approach with SAS, Hadoop and open source R, they seemed to be pointing out that there is no 'easy' R open source solution which exists for building a poisson regression model on large datasets.
This post is about showing that fitting a generalized linear model to large data in R easy in open source R and just works".

As you may know Revolution Analytics is the company that sells the commercial version of R. Inspired by Bilgi, I set out to learn the R package "ff" and at the same time tried to get some large enough data to experiment with. Shortly after discovering this article, I was lucky to be visiting Singapore and so was able to download large data files. I was thinking about getting large data files of about 1 terabyte so I bought one 4 TB hard disk from Amazon.

Then I was able to download a number of large data sets, but none was close to 1 TB. The largest one was the American Statistical Association's 2009 Data Expo data set involving the flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008 containing about 120 million records for 29 variables. The compressed file was about 1.7 GB and expanded to about 12 GB in size. The actual data set I used for exploring the analysis of large data sets was the 5 percent sample of population census of US available from IPUMS-USA at: https://usa.ipums.org/usa/. It contains about 5.7 million household records and about 14 million person records. I am a bit familiar with household surveys and that was the main reason behind choosing this data set. The other reason was that as our own census was just a few months away we could learn how to analyze census data so that if we could get similar data from our own census later, we would be ready to do our own research.
"The Integrated Public Use Microdata Series (IPUMS) consists of over sixty high-precision samples of the American population drawn from fifteen federal censuses, from the American Community Surveys of 2000-2012, and from the Puerto Rican Community Surveys of 2005-2012. Some of these samples have existed for years, and others were created specifically for this database".

Unfortunately, the IPUMS-International has no census data on Myanmar, though it includes countries from Africa, Asia, Europe, and Latin America for 1960 forward. The database currently includes 159 samples from 55 countries around the world.

Having downloaded the US census 5 percent data I have to figure out how I would go on analyzing it with R, my chosen software. My laptop has 8 GB of RAM, Intel i5 processor and running Windows 7. As seen above, R is optimal with about 20% of available RAM. That means I could just use only about 1 GB of RAM, while the data is about 12 GB in size.

Technically speaking R's problem with handling big data set involves two aspects: memory limitations and addressing limitations. These could be handled through a trick called memory mapping. In the CRAN Task View High-Performance and Parallel Computing with R (see my post 'An Unclaimed CD on Psychometrics with R or Intro to Anything with R') under the topic 'Large memory and out-of-memory data' you can find short description of R package 'ff' that makes data stored on disk behaves like it is in RAM, and the The ffbase package that adds basic statistical functionality to the ff package. A good, and rather technical, account of ff package is given in the presentation: http://user2007.org/program/presentations/adler.pdf.

Following Bilgi's example in his article "bigglm on your big data set in open source R, it just works – similar as in SAS" we used the ff and ffbase packages to load and manipulate the US Census 5 percent data set, and used the biglm package to test apply linear model and generalized linear model to it successfully. Here are some of the benchmarks:

       Importing US Census 5 percent data set into ff format: 11.9 minutes.
       Extract household level information in ff format: 7.8 minutes.
       Removing households with missing values and reformatting data: 28.56 seconds.
       Running generalized linear model with biglm package on 5,273,998 households: 40.26 seconds.


The result is:

Here, age is the age of household head (in single years), sex is (Male/Female) of head, household size is the number of persons in the household, and ownership is (Owned or being bought/ rented) the tenure for the dwelling.

In Bilgi's article he first worked on about 2.8 million records and then exploded the data by a factor of 100 to create 280 million records and analyzed that with the same procedure. I didn't follow his example because that may take one or two hours to complete. But I am confident it could be done.

On the other hand I had done the same analysis on the same 5 million records, but this time using another approach to big data. This approach is to use a different kind of database management system called a column-oriented database to feed the data required for analysis. The standard relational databases handle the data by rows and are not very good for working with big data. I used the open source MonetDB database software (not an R package). I used the MonetDB.R package in R to connect MonetDB from R. They have now improved on this approach and starting with the Oct2014 release, MonetDB will ship with a feature called R-Integration. I have yet to download this new version of MonetDB, learn it, and try it out.

I learned to use MonetDB for processing big data with R by following Anthony Damico's examples of processing sixty-seven million physician visit records available at: http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

There you will find the link to download the MonetDB software as well as why and how to install monetdb with r. There is also links to a good list of public-use data sets which you can download including those that appeared in his code examples.

You may also want to visit the official MonetDB website at: https://www.monetdb.org/.



Saturday, December 6, 2014

Big data: mobile data for development


When I said in my last post that 'One that will be interesting to the general public, administrators, or researchers, and "must read" for MPT, Ooredoo, and Telenor our mobile communication providers if they haven't done so is the Mobile Data for Development Primer by Global Pulse ...', for the must read part I had in mind the young local staff and not those of managerial or professional levels, or the expatriates. I suppose ordinary folks could never be in their league, at least for the time being.

However, when I looked up just now about Ooredoo and Telenor in Myanmar they seem to have been doing something more than providing telecom services. I found Ooredoo launching Myanmar Maternity application “maymay” and Telenor launching "Telenor Light Houses" (Community Information Centers).

The figure below is from 'Using mobile data for development' report available at:
It is good to see that Telenor has been active in the use of mobile data for epidemic surveillance and disease containment and information strategies.

We have seen that Mobile Phone Network Data for Development published in November 2013 by Global Pulse is a primer on how analysis of Call Detail Records (CDRs) can provide information for humanitarian and development purposes. It is essentially an advocacy document for the utilization of metadata from mobile phone calls and messaging for development. In a nutshell, "the document explains three types of indicators that can be extracted through analysis of CDRs (mobility, social interaction and economic activity), includes a synthesis of several research examples and a summary of privacy protection considerations".

The 'Using mobile data for development' report, produced by the Bill & Melinda Gates Foundation in conjunction with strategy consulting firm Cartesian and published in May 2014, covered the same themes as the Global Pulse Primer, but in considerably more detail. Additionally the first two chapters give (i) the situation of the adoption and usage of mobile phone in the developing world, and (ii) what data are captured by the mobile data systems and how they could be interpreted and used. These would be most useful to analysts and researchers.

The table of contents of this report gives a good idea of what to expect:

1.  Executive summary
2.  Adoption and Usage of Mobile Phones in Developing Countries
         Mobile Adoption Rates Are High in Developing Countries 
         Rates of Mobile Ownership and Usage Are Equally High among the Poor 
         Usage Patterns in Developing Countries Differ from Those in the Developed World

The report says that people in developing countries use more voice and SMS text as compared to data. However in Myanmar because of lack of alternatives most of us have to use mobile phone for internet connection and so our data usage could be markedly higher than others. I guess most of us who own a desktop or a laptop would be using smart phones as wifi hotspots to connect. Before mobiles and cheap SIM cards became available, I used the slower than slow prepaid dial-up connection at home which was useless for purposes other than reading email. If I need to visit the Internet cafes, their speed was not too much faster. Currently the Wimax and other options are too expensive. While ADSL by MPT is quite good it is not widely available. I lived in downtown Yangon and yet when I applied for ADSL, I was told that it is not available for the whole of my township because some kind of equipment or facility is not there.

The fact is that if we can't have some reasonably priced broadband option like cable networks becoming available, we would still be with a virtual connectivity drought. Meanwhile, MPT, Ooredoo and Telenor should make the data rates cheaper too and what is better than letting the market take care of it, for example, by promoting cable companies and startups.

For a people denied of the advantage of good and cheap mobile communication until recently, we would like to see how we compared with others. Unfortunately we aren't there.

I don't know if there is good data on that for Myanmar. Anyway, from the wiki page 'Demographics of Burma' we take the percent of population 15 years and over of 72.5% as the proportion of adult population (consistent with figure-1) and using a growth rate of 1.07% and the total population as of 2014 census at 51,419,420 we calculated the 2012 population. With the estimate of number of mobile phones as 5,400,000 for 2012 (wiki page, Telecommunications in Burma; which is quite rough) we get an estimate of ownership among adults of about 15%.

3.  What Is Captured in Mobile Data Systems? 
         How a Mobile Network Functions 
         Mobile Data Captures a Wide Range of Customer Behaviors 

Under this topic the three most important type of information captured are:
(i) "Location and mobility: Location is tracked passively when users’ phones interact with towers in each cell cite they visit, and actively, each time a user initiates a voice call, SMS, or other transaction."
(ii) "The social network: Calling and SMS patterns create a lens into a person’s social network, including who they communicate with, how long, and how often. Further, in many emerging markets the originator of a phone call pays for the minutes of the call; even understanding who someone calls vs. who calls them can give a sense of social stature among a social network, important for marketers seeking to reach nodal hubs of influencers."
(iii) "Recharge and purchase history: Patterns of recharging minutes and purchases of VAS (value-added services) can give insights into an individual’s economic circumstances and the financial shocks or difficulties they face."

         How Mobile Data Is Captured From User Interactions 
         How Mobile Data Can Determine a Person’s Location 
         Gauging the Availability and Accessibility of Data 
         Understanding and Interpreting Different Kinds of Mobile Data 

There is great potential in that "signal characteristics such as attenuation and signal distortion can be measured to provide an indication of local ecology, rainfall patterns, civil construction, etc. Figure 11 explains how attenuation from radio signals was used to collect a large amount of accurate and timely rainfall data."

4.  Present and Possible Applications 
         Potential Insights from Mobile Data 


  
         Mobile Operators Are Beginning to Exploit Mobile Data 
         Current Uses of Mobile Data in Development Programs 
         Future Opportunities to Leverage Mobile Data 
         Should Mobile Data for Philanthropic Use Be Free? 

"The United Nations Global Pulse has put forward the idea of 'data philanthropy,' where operators would have a duty to share data for certain limited uses when the public good is urgent and clear. Global Pulse argues that these cases actually make business sense ... "

5.  Regulatory Landscape and Data Privacy Considerations 
         Regulatory Regimes Are Becoming Clearer and More Standardized 
6.  Considerations for Data Sharing 
         Sensitivities around Mobile Data Access 
         Commercial Sensitivities around Data Access 
         Public Opinion Plays a Role 
         Approaches to Protecting Data Privacy 
7.  Conclusion

"Mobile data has enormous potential to support development efforts and through this to improve the lives of poor people around the world. ... Mobile data offers larger and more representative samples, in near real-time, and at far lower costs than alternative means of data gathering. Indeed we believe the opportunities to leverage these data sets for development goals are only starting to be explored. ... There is an opportunity to learn from the best examples that have been demonstrated to develop a foundation for broader use of mobile data ... ".

It seems clear that mobile CDR data has enormous potential for development applications. To enrich and accelerate research efforts, third party and government data sets will have to be combined with those of the mobile CDR data. That means those areas can't be neglected and they need to develop along with big data. To be able to generate useful CDR data, mobile communication service penetration into the population and rural areas have to be sufficiently high, though urban related, or some other applications may be feasible earlier. At present, big data requires big hardware and perhaps big brains and to round them up—big money. So, it will again be the old story: when some opportunity opens up, a handful of people who could take advantage of the situation would take all. But when you are lagging behind and you want to leapfrog, you would need more than the critical mass of smart people needed just to keep the machines running.


Wednesday, December 3, 2014

Big data, MPT, Ooredoo, and Telenor



I read about big data for the first time in 2013 and that was from the Web. When I talked about it, in our small community of moderately curious nuts, I was the only one who has heard of it. Here, it is easy to blame the slow internet connection for missing many of the things happening in the world.

So what is big data and is it important? How important?

It was exactly the same questions I asked myself and the answers to my questions were revealed to me bit by bit as I looked more and more for answers. My first encounter with big data was "A business report on BIG DATA GETS PERSONAL" from MIT Technology Review of 2013 available for download.
 





I quickly read through it and was left with a fear if this big technology will rob me of the little privacy I have. This fear seems justified as I searched and read on. I found that in developed countries big data is used in private business and every big business is joining the band-wagon; in contrast, big data is not yet known by businesses in the developing world, though governments noticed it, took interest in it, and have probably been exploring its potential as a tool for intelligence and control.

After second thought I realized they, whoever they are, would be after much bigger fish than me and I would have the natural protection of safety in numbers. At this stage of my understanding data-wise, big data means each of us small fry would be assigned with some numbers, lot and lots of us would then be scooped up and shoveled into some kind of analytic machine. The logic, mostly implicit, of this machine being big data = all data. Then, out we come collectively as neat patterns of predictions that would shape our future consumption behavior or individually as perfectly shaped pawns.

From what has been expounded in this collection of articles, I wasn't much impressed with what big data could do for the individual, because those seem too far removed from us. But I felt it is too real to ignore what has been said in The Dictatorship of Data.

Big data is poised to transform society, from how we diagnose illness to how we educate children, even making it possible for a car to drive itself. Information is emerging as a new economic input, a vital resource. Companies, governments, and even individuals will be measuring and optimizing everything possible.

But there is a dark side. Big data erodes privacy. And when it is used to make predictions about what we are likely to do but haven’t yet done, it threatens freedom as well. Big data also exacerbates a very old problem: relying on the numbers when they are far more fallible than we think. Nothing underscores the consequences of data analysis gone awry more than the story of Robert McNamara.
McNamara was a numbers guy. Appointed the U.S. secretary of defense when tensions in Vietnam rose in the early 1960s, he insisted on getting data on everything he could. Only by applying statistical rigor, he believed, could decision makers understand a complex situation and make the right choices. ...

Among the numbers that came back to him was the “body count.” ... A mere 2 percent of America’s generals considered the body count a valid way to measure progress. “A fake—totally worthless,” wrote one general in his comments. “Often blatant lies,” wrote another. “They were grossly exaggerated by many units primarily because of the incredible interest shown by people like McNamara,” said a third.
The use, abuse, and misuse of data by the U.S. military during the Vietnam war is a troubling lesson about the limitations of information as the world hurls toward the big-data era. The underlying data can be of poor quality. It can be biased. It can be misanalyzed or used misleadingly. And even more damningly, data can fail to capture what it purports to quantify.

Even today, the best of the top executives may not be able to evade the dictatorship of the data, for example, one top executive in Google tried:

... To determine the best color for a toolbar on the website ... once ordered staff to test 41 gradations of blue to see which ones people used more. In 2009, Google’s top designer, Douglas Bowman, quit in a huff because he couldn’t stand the constant quantification of everything. “I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. ... "

Such as these could have been dismissed easily as whims of the rich and powerful toying with ideas, but when those with authority and power become obsessed with the power and promise of big data it is another matter altogether.

Big data will be a foundation for improving the drugs we take, the way we learn, and the actions of individuals. However, the risk is that its extraordinary powers may lure us to commit the sin of McNamara: to become so fixated on the data, and so obsessed with the power and promise it offers, that we fail to appreciate its inherent ability to mislead.

Is it all that big data has to offer to humanity? All this seems to be confined to the other half of the digital divide (and logically the big data divide), this other half with perfect connectivity, smart gadgets, smart home, smart people living in an opulent world, a repository of world's knowledge. Though not necessarily of its wisdom, I would timidly add.  Gandhi once said: earth provides enough to satisfy every man’s need but not for every man’s greed. We find greed on both sides of the digital divide and yet it could be more severe on our side because of the lack of mechanism to check it or because the existing mechanism malfunctions.  Then, unsatisfied with the realm of big data as I have discovered, I tried looking for its potential in development. Then I found out the big data challenge by Orange.

Orange the mobile telecommunication provider in Africa offered to make its Call Detail Record data to the participants in its challenge to find the best way to use this data for development.

The Orange \Data for Development" (D4D) challenge is an open data challenge on anonymous call patterns of Orange's mobile phone users in Ivory Coast. The goal of the challenge is to help address society development questions in novel ways by contributing to the socio-economic development and well-being of the Ivory Coast population. Participants to the challenge are given access to four mobile phone datasets ... The datasets are based on anonymized Call Detail Records (CDR) of phone calls and SMS exchanges between five million of Orange's customers in Ivory Coast between December 1, 2011 and April 28, 2012. The datasets are: (a) antenna-to-antenna traffic on an hourly basis, (b) individual trajectories for 50,000 customers for two week time windows with antenna location information, (3) individual trajectories for 500,000 customers over the entire observation period with sub-prefecture location information, and (4) a sample of communication graphs for 5,000 customers.




The organizers expected 40 or 50 project applications and got 260 instead. D4D winners were announced in first week of May 2013. Among the four winners, one addressing mobility and transport, and the other addressing disease containment and information campaigns seem most relevant to Myanmar situation and extracts for them have been provided by way of introduction. For more information, you may want to follow the links provided.

Best Visualization prize winner: “Exploration and Analysis of Massive Mobile Phone Data: A Layered Visual Analytics Approach”


Best Development prize winner: “AllAboard: a System for Exploring Urban Mobility and Optimizing Public Transport Using Cellphone Data”

With large scale data on mobility patterns, operators can move away from the costly and resource intensive four-step transportation planning processes prevalent in the West, to a more data-centric view, that places the instrumented user at the center of development. In this framework, using mobile phone data to perform transit analysis and optimization represents a new frontier with significant societal impact, especially in developing countries.
AllAboard is a system to optimize the planning of a public transit network using mobile phone data with the goal to improve ridership and user satisfaction.
Mobile phone location data is used to infer origin-destination flows in the city, which are then converted to ridership on the existing transit network.
Sequential travel patterns from individual call location data is used to propose new candidate transit routes. An optimization model evaluates how to improve the existing transit network to increase ridership and user satisfaction, both in terms of travel and wait time.

Best Scientific prize winner: “Analyzing Social Divisions Using Cell Phone Data”


First prize winner: “Exploiting Cellular Data for Disease Containment and Information Campaigns Strategies in Country-Wide Epidemics”

... human mobility is one of the key factors at the basis of the spreading of diseases in a population. Containment strategies are usually devised on movement scenarios based on coarse-grained assumptions. Mobility phone data provide a unique opportunity for building models and defining strategies based on precise information about the movement of people in a region or in a country. Another important aspect is the underlying social structure of a population, which might play a fundamental role in devising information campaigns to promote vaccination and preventive measures, especially in countries with a strong family (or tribal) structure. Among the issues that developing countries are facing today, healthcare is probably the most urgent. The effectiveness of health campaigns is often reduced due to low availability of data, inherent limits in the infrastructure and difficult communication with citizens.
... We present a model that describes how diseases spread across the country by exploiting mobility patterns of people extracted from the available data. Then, we simulate several epidemics scenarios and we evaluate mechanisms to contain the spreading of diseases, based on the information about people mobility and social ties.
If you go on to look for related information you will find a lot of information on the prospect of using mobile phone data for development. One that will be interesting to the general public, administrators, or researchers, and "must read" for MPT, Ooredoo, and Telenor our mobile communication providers if they haven't done so is the Mobile Data for Development Primer by Global Pulse available at: http://www.unglobalpulse.org/sites/default/files/Mobile%20Data%20for%20Development%20Primer_Oct2013.pdf