Thursday, December 11, 2014

Big data: small guys could do it?


After reading through quite a bit of discussions, tutorials, reports, blogs, primers, Q&A's, proceedings, popular articles, and Wiki pages about big data and convinced that it could do really good things for development, I felt I need to have some hands-on experience with it. Now, the question is: could some small guy with a moderately powerful laptop with some knowledge of R and a slow internet connection do it?  After all, most of what I have read seems to give you "Don't try this at home" kind of warning. One says working with big data requires "massively parallel software running on tens, hundreds, or even thousands of servers".

And then really large data in the terabyte range or larger are usually handled by Apache Hadoop software. Wikipedia describes Hadoop as "an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware". The underlying idea is to "split" the data into manageable sizes, do your calculations on them separately at the same time ("apply"), and "combine" the results.

I read the Wiki page on big data and it said: "Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization". My interest principally is analysis and perhaps visualization as a part of analysis. So I looked for examples of analysis of data too big to handle with one laptop using open source statistical software like R.

At the same time I was aware of the fact that though R is an excellent statistical environment its limitation is that it needed to have all the data it meant to process entirely in the memory of the computer. According to Kane et al (Scalable Strategies for Computing with Massive Data, Journal of Statistical Software, November 2013, Volume 55, Issue 14) for using with R a data set should be considered large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Then I found the article "bigglm on your big data set in open source R, it just works – similar as in SAS" at http://kadimbilgi.blogspot.com/2012/11/bigglm-on-your-big-data-set-in-open.html. The author (Bilgi) said

"In a recent post by Revolution Analytics (link & link) in which Revolution was benchmarking their closed source generalized linear model approach with SAS, Hadoop and open source R, they seemed to be pointing out that there is no 'easy' R open source solution which exists for building a poisson regression model on large datasets.
This post is about showing that fitting a generalized linear model to large data in R easy in open source R and just works".

As you may know Revolution Analytics is the company that sells the commercial version of R. Inspired by Bilgi, I set out to learn the R package "ff" and at the same time tried to get some large enough data to experiment with. Shortly after discovering this article, I was lucky to be visiting Singapore and so was able to download large data files. I was thinking about getting large data files of about 1 terabyte so I bought one 4 TB hard disk from Amazon.

Then I was able to download a number of large data sets, but none was close to 1 TB. The largest one was the American Statistical Association's 2009 Data Expo data set involving the flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008 containing about 120 million records for 29 variables. The compressed file was about 1.7 GB and expanded to about 12 GB in size. The actual data set I used for exploring the analysis of large data sets was the 5 percent sample of population census of US available from IPUMS-USA at: https://usa.ipums.org/usa/. It contains about 5.7 million household records and about 14 million person records. I am a bit familiar with household surveys and that was the main reason behind choosing this data set. The other reason was that as our own census was just a few months away we could learn how to analyze census data so that if we could get similar data from our own census later, we would be ready to do our own research.
"The Integrated Public Use Microdata Series (IPUMS) consists of over sixty high-precision samples of the American population drawn from fifteen federal censuses, from the American Community Surveys of 2000-2012, and from the Puerto Rican Community Surveys of 2005-2012. Some of these samples have existed for years, and others were created specifically for this database".

Unfortunately, the IPUMS-International has no census data on Myanmar, though it includes countries from Africa, Asia, Europe, and Latin America for 1960 forward. The database currently includes 159 samples from 55 countries around the world.

Having downloaded the US census 5 percent data I have to figure out how I would go on analyzing it with R, my chosen software. My laptop has 8 GB of RAM, Intel i5 processor and running Windows 7. As seen above, R is optimal with about 20% of available RAM. That means I could just use only about 1 GB of RAM, while the data is about 12 GB in size.

Technically speaking R's problem with handling big data set involves two aspects: memory limitations and addressing limitations. These could be handled through a trick called memory mapping. In the CRAN Task View High-Performance and Parallel Computing with R (see my post 'An Unclaimed CD on Psychometrics with R or Intro to Anything with R') under the topic 'Large memory and out-of-memory data' you can find short description of R package 'ff' that makes data stored on disk behaves like it is in RAM, and the The ffbase package that adds basic statistical functionality to the ff package. A good, and rather technical, account of ff package is given in the presentation: http://user2007.org/program/presentations/adler.pdf.

Following Bilgi's example in his article "bigglm on your big data set in open source R, it just works – similar as in SAS" we used the ff and ffbase packages to load and manipulate the US Census 5 percent data set, and used the biglm package to test apply linear model and generalized linear model to it successfully. Here are some of the benchmarks:

       Importing US Census 5 percent data set into ff format: 11.9 minutes.
       Extract household level information in ff format: 7.8 minutes.
       Removing households with missing values and reformatting data: 28.56 seconds.
       Running generalized linear model with biglm package on 5,273,998 households: 40.26 seconds.


The result is:

Here, age is the age of household head (in single years), sex is (Male/Female) of head, household size is the number of persons in the household, and ownership is (Owned or being bought/ rented) the tenure for the dwelling.

In Bilgi's article he first worked on about 2.8 million records and then exploded the data by a factor of 100 to create 280 million records and analyzed that with the same procedure. I didn't follow his example because that may take one or two hours to complete. But I am confident it could be done.

On the other hand I had done the same analysis on the same 5 million records, but this time using another approach to big data. This approach is to use a different kind of database management system called a column-oriented database to feed the data required for analysis. The standard relational databases handle the data by rows and are not very good for working with big data. I used the open source MonetDB database software (not an R package). I used the MonetDB.R package in R to connect MonetDB from R. They have now improved on this approach and starting with the Oct2014 release, MonetDB will ship with a feature called R-Integration. I have yet to download this new version of MonetDB, learn it, and try it out.

I learned to use MonetDB for processing big data with R by following Anthony Damico's examples of processing sixty-seven million physician visit records available at: http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html

There you will find the link to download the MonetDB software as well as why and how to install monetdb with r. There is also links to a good list of public-use data sets which you can download including those that appeared in his code examples.

You may also want to visit the official MonetDB website at: https://www.monetdb.org/.



No comments:

Post a Comment