After reading through quite a bit of discussions, tutorials,
reports, blogs, primers, Q&A's, proceedings, popular articles, and Wiki
pages about big data and convinced that it could do really good things for
development, I felt I need to have some hands-on experience with it. Now, the question
is: could some small guy with a moderately powerful laptop with some knowledge
of R and a slow internet connection do it?
After all, most of what I have read seems to give you "Don't try this at home" kind of
warning. One says working with big data requires "massively parallel software running on tens, hundreds, or even
thousands of servers".
And then really large data in the terabyte range or larger
are usually handled by Apache Hadoop software. Wikipedia describes Hadoop as
"an open-source software framework for distributed storage and distributed
processing of Big Data on clusters of commodity hardware". The underlying
idea is to "split" the data into manageable sizes, do your calculations
on them separately at the same time ("apply"), and
"combine" the results.
I read the Wiki page on big data and it said: "Big data
is the term for a collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools or traditional
data processing applications. The challenges include capture, curation,
storage, search, sharing, transfer, analysis, and visualization". My
interest principally is analysis and perhaps visualization as a part of analysis.
So I looked for examples of analysis of data too big to handle with one laptop
using open source statistical software like R.
At the same time I was aware of the fact that though R is an
excellent statistical environment its limitation is that it needed to have all
the data it meant to process entirely in the memory of the computer. According
to Kane et al (Scalable Strategies for
Computing with Massive Data, Journal of Statistical Software, November 2013, Volume 55,
Issue 14) for using with R a data set should be considered large if it
exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.
Then I found the article "bigglm on your big data set in open source R, it just works – similar
as in SAS" at http://kadimbilgi.blogspot.com/2012/11/bigglm-on-your-big-data-set-in-open.html.
The author (Bilgi) said
"In a recent post by Revolution Analytics (link & link) in which Revolution was
benchmarking their closed source generalized linear model approach with SAS,
Hadoop and open source R, they seemed to be pointing out that there is no
'easy' R open source solution which exists for building a poisson
regression model on large datasets.
This post is about showing that fitting a generalized linear
model to large data in R easy in open source R and just works".
As you may know Revolution Analytics is the company that
sells the commercial version of R. Inspired by Bilgi, I set out to learn the R
package "ff" and at the same time tried to get some large enough data
to experiment with. Shortly after discovering this article, I was lucky to be
visiting Singapore and so was able to download large data files. I was thinking
about getting large data files of about 1 terabyte so I bought one 4 TB hard
disk from Amazon.
Then I was able to download a number of large data sets, but
none was close to 1 TB. The largest one was the American Statistical
Association's 2009 Data Expo data set involving the flight arrival and
departure details for all commercial flights within the USA, from October 1987
to April 2008 containing about 120 million records for 29 variables. The
compressed file was about 1.7 GB and expanded to about 12 GB in size. The
actual data set I used for exploring the analysis of large data sets was the 5
percent sample of population census of US available from IPUMS-USA at: https://usa.ipums.org/usa/. It contains
about 5.7 million household records and about 14 million person records. I am a
bit familiar with household surveys and that was the main reason behind
choosing this data set. The other reason was that as our own census was just a
few months away we could learn how to analyze census data so that if we could
get similar data from our own census later, we would be ready to do our own
research.
"The Integrated Public Use Microdata Series
(IPUMS) consists of over sixty high-precision samples of the American
population drawn from fifteen federal censuses, from the American Community
Surveys of 2000-2012, and from the Puerto Rican Community Surveys of 2005-2012.
Some of these samples have existed for years, and others were created
specifically for this database".
Unfortunately, the IPUMS-International has no census data on
Myanmar, though it includes countries from Africa, Asia, Europe, and Latin
America for 1960 forward. The database currently includes 159 samples from 55
countries around the world.
Having downloaded the US census 5 percent data I have to
figure out how I would go on analyzing it with R, my chosen software. My laptop
has 8 GB of RAM, Intel i5 processor and running Windows 7. As seen above, R is
optimal with about 20% of available RAM. That means I could just use only about
1 GB of RAM, while the data is about 12 GB in size.
Technically speaking R's problem with handling big data set
involves two aspects: memory limitations and addressing limitations. These
could be handled through a trick called memory mapping. In the CRAN Task View High-Performance and Parallel Computing with
R (see my post 'An Unclaimed CD on Psychometrics with R or
Intro to Anything with R') under the topic
'Large memory and out-of-memory data'
you can find short description of R package 'ff' that makes data stored on disk behaves like it is in RAM, and the The ffbase package that adds basic
statistical functionality to the ff package. A
good, and rather technical, account of ff package is given in the presentation:
http://user2007.org/program/presentations/adler.pdf.
Following Bilgi's example in his article "bigglm on your big data set in open source
R, it just works – similar as in SAS" we used the ff and ffbase packages to
load and manipulate the US Census 5 percent data set, and used the biglm package to test apply linear model
and generalized linear model to it successfully. Here are some of the
benchmarks:
□ Importing
US Census 5 percent data set into ff format: 11.9 minutes.
□ Extract
household level information in ff format: 7.8 minutes.
□ Removing
households with missing values and reformatting data: 28.56 seconds.
□ Running
generalized linear model with biglm
package on 5,273,998 households: 40.26 seconds.
The result is:
Here, age is the age of household head (in single years),
sex is (Male/Female) of head, household size is the number of persons in the
household, and ownership is (Owned or being bought/ rented) the tenure for the
dwelling.
In Bilgi's article he first worked on about 2.8 million
records and then exploded the data by a factor of 100 to create 280 million
records and analyzed that with the same procedure. I didn't follow his example
because that may take one or two hours to complete. But I am confident it could
be done.
On the other hand I had done the same analysis on the same 5
million records, but this time using another approach to big data. This
approach is to use a different kind of database management system called a column-oriented database to feed the data required for analysis.
The standard relational databases handle the data by rows and are not very good
for working with big data. I used the open source MonetDB database software (not an R package). I used the MonetDB.R package in R to connect MonetDB from R. They have now improved
on this approach and starting with the Oct2014 release, MonetDB will ship with a feature called R-Integration. I have yet to download this new version of MonetDB, learn it, and try it out.
I learned to use MonetDB for processing big data with R
by following Anthony Damico's examples of processing sixty-seven million physician
visit records available at: http://www.asdfree.com/2013/03/column-store-r-or-how-i-learned-to-stop.html
There you will find the link to download the MonetDB software as well
as why and how to install monetdb
with r. There is also links to a good list of public-use
data sets which you can download including those that appeared in his code
examples.
You may also want to visit the official MonetDB website at: https://www.monetdb.org/.
No comments:
Post a Comment