Saturday, January 31, 2015

Little data: facing the one-legged little wind


Big data has been called a data tsunami. It has been described as data exhaust or found data. Perhaps the key distinction between big data and little data is that in the latter you have the option to make your data represent the population you are targeting your research. Like for example in a sample survey.

A lazy way to get some idea of how little data measure up to big data (with all the hype) is to do a Google search, I suppose.

Google Search: little data vs big data
All time
Past year
Past month

David Vs. Goliath: Why Little Data Will Win Over Big Data

David Vs. Goliath: Why Little Data Will Win Over Big Data

Little Data vs. Big Data: Does Size Matter? | 6Sense

Market Research - Little Data vs. Big Data: Nine Types of ...

Market Research - Little Data vs. Big Data: Nine Types of ...

What's Holding Us Back From Big Data? Daniel Burrus ...

Small data vs big data: the battle that never was ...

Small data vs big data: the battle that never was ...

Big data - Wikipedia, the free encyclopedia

You May Not Need Big Data After All - HBR

Little Data vs. Big Data: Does Size Matter? | 6Sense

The Big Buzz About Big Data | UKFast Blog

Forget big data, small data is the real revolution | News | The ...

Big Data vs. Small Data - Is there a Difference ...

Microsoft vs US.gov, Internet of Stuff, Big Data - The Channel

Little Data vs. Big Data: Does Size Matter? | 6Sense

Why Companies Need to Focus on 'Little Data' - WSJ Blogs

Our Future: Free Will vs. Predictions with Data - Lutz Finger

Is Little Data The Next Big Data? | Jonah Berger | LinkedIn

Little privacy in the age of big data - The Guardian

AllAnalytics - Matthew Brodsky - Big Chief Data, Little Chief ...

Is Little Data The Next Big Data? | Jonah Berger

Forget Big Data. Use Little Data for Incremental Self ...

6sense | LinkedIn


Big Data vs. Small Data - Is there a Difference ...

Big Data vs. CRM: How Can They Help Small Businesses?

Hype vs Reality regarding Big Data? | James McGovern ...

Big data - Wikipedia, the free encyclopedia

Big Data vs Little Data - Sales Initiative

Data Informed | Big Data and Analytics in the Enterprise


These were the first pages of search results for three different time frames and without looking at their contents, I felt the idea that little data could hold its ground would be quite the dominant opinion. That insight could have been quite wrong, based upon just the titles from first pages of information that is a product of big data! So I better go non-committal and say "use both, suitably".

For that matter the title "Small data vs big data: the battle that never was" of the June 2, 2014 post by Pam Baker in FierceBigData site makes me feel like I've found a sympathizer of this view. However, she was thinking about little data as subsets of big data:

Every so often media reports come blasting the message that little data wins over big data. Give it a minute and more media reports will come out saying the opposite. So which is winning in the business arena--big or small data? Neither. This is the battle that never was. There is a time for big data and a time for little data. Further, big data is made of little data and it's ridiculous to pit the piece against the whole and declare one the all-occasion winner. Further still, one almost always drills down to little data after gaining the big data, big picture insight. Why would one step be superior to another in the same process?

To use another metaphor to make the point: when you pit small data against big data you are not comparing apples to oranges but a bushel of apples against a planet of orchards.

And all the search results as well as her post were talking about business applications, while we are interested in the use of big data for development.

On the other hand, it is said that the future of big data is all about predictions. Time and again we learn that the sheer size of data is no substitute for relevant data. Lutz Finger in "Our Future: Free Will vs. Predictions with Data" contrasted one example of big data times against a prediction in ancient past:

But often it is not the amount of data that matters to create a good prediction. For example, the Incas predicted the best time to plant crops. Their dataset might have been as little as 3560 data points (= 10 years) – nothing in our big data world. 500 years later we have companies like Google that measure a lot about our online behavior. But despite all this data, predictions are not necessarily easy. For example, New York Times bestselling business author Carol Roth once complained in her blog that Google infers that she is a male over age 65, when in fact she is a woman decades younger.

Why is this? Because not all of the data Google has aggregated is really helpful for the specific prediction they try to make. 

Back to our theme, traditionally, data for development comes from the research community and official statistics and comprises experimental data, observational data or survey data and administrative records. These are the little data I'm thinking about and I may simply say that little data is the kind of data we have before big data came around and most people may have been getting aware of big data only after 2011 or so.


So, before the "big buzz about big data" there had been the little data and it was long recognized as the basis for evidence-based policymaking and monitoring in all countries, especially for developing countries. In the area of little data, the Paris21 consortium is a partnership of policymakers, analysts, and statisticians from all countries of the world, focusing on promoting high-quality statistics, making these data meaningful, and designing sound policies. It was established in November 1999 in response to the UN Economic and Social Council resolution on the goals of the UN Conference on Development. A significant project of Paris21, currently, is the Informing a Data Revolution (IDR)  funded by a grant from the Bill and Melinda Gates Foundation. Paris21 asked "Are developing countries ready for the data revolution?"

Are we ready for the data revolution? In the old days we would jokingly answer—"it's good; spicy hot, though".  Now, I remember my days as a youngster fascinated by little whirlwinds we call lay-bway. You can't guess where it is going and it is this that makes them so fascinating. If one brushes you with all the leaves, dust and sand floating around it sting your eyes. I remember one of our writers of the old generation, Ze-ya, imaginatively called it one-legged little wind, which we would have expected from a writer like Dagon-taya and not from him.

But what's this data revolution anyway? In their report "A World that Counts: Mobilising the Data Revolution for Sustainable Development" of November 2014
(http://www.undatarevolution.org/wp-content/uploads/2014/11/A-World-That-Counts.pdf), the Independent Expert Advisory Group gives the rationale:

As the world embarks on an ambitious project to meet new Sustainable Development Goals  (SDGs), there is an urgent need to mobilise the data revolution for all people and the whole planet in order to monitor progress, hold governments accountable and foster sustainable development. More diverse, integrated, timely and trustworthy information can lead to better decision-making and real-time citizen feedback. (Executive summary, p. 3)

And defines data revolution this way.

The data revolution is:
         An explosion in the volume of data, the speed with which data are produced, the number of producers of data, the dissemination of data, and the range of things on which there is data, coming from new technologies such as mobile phones and the “internet of things”, and from other sources, such as qualitative data, citizen-generated data and perceptions data;
         A growing demand for data from all parts of society.

After all it reads like what you see in any writing about big data these days. May be I could summarize it for the dummies: (i) Let there be big data, and (ii) Witness the surge in demand for data.

Then they link data revolution with sustainable development goals. There were three bullets, but seems to me that the first is the one that is essential.

The data revolution for sustainable development is:
         The integration of these new data with traditional data to produce high-quality information that is more detailed, timely and relevant for many purposes and users, especially to foster and monitor sustainable development;

So now, (iii) Let's arrange a marriage of the little data with the most eligible big data. 
I am glad that that is what I arrived at vaguely (or more plainly, through guesswork) and I am not sure if that is not a marriage of convenience. But how you actually get the little data married to the big data (I guess they may just have been working on match-making), and specifically for the stewardship of sustainable development?

The executive summary gives how data revolution for sustainable development could be used: (i) directly through enabling to "monitor progress", and (ii) complementarily through "... hold(ing) governments accountable ... (and getting) real-time citizen feedback." Here the second part could be seen also as a revolution for equality between the data rich and the data poor:

... the data revolution can be a revolution for equality. More, and more open, data can help ensure that knowledge is shared, creating a world of informed and empowered citizens, capable of holding decision-makers accountable for their actions. (p. 8)

But where's this eye-stinging part? Seems like nations with a lot of catch up to do could find coping with data revolution a bit spicy-hot. Particularly, those governments with creaking national data infrastructures will have to face quite formidable tasks like these:

National statistical offices, the traditional guardians of public data for the public good, will remain central to the whole of government efforts to harness the data revolution for sustainable development. To fill this role, however, they will need to change, and more quickly than in the past, and continue to adapt, abandoning expensive and cumbersome production processes, incorporating new data sources, including administrative data from other government departments, and focusing on providing data that is human and machine-readable, compatible with geospatial information systems and available quickly enough to ensure that the data cycle matches the decision cycle.

Anyway when you open your windows and this sudden gust of lay-bway hits your face and sting your eyes, you need not panic. Think of that as ventilation a bit stronger than usual.

Things need to be done have to be done somehow and as usual the UN post-2015 development agenda does not come without a package to assist—partnership to catalyze global solidarity for sustainable development in this case. Also, you could look for technical assistance from projects like Informing a Data Revolution (IDR) and others.

We are glad to know that Myanmar already has good relations with Paris-21. It is one from eleven countries of Southeast Asia, South Asia, and North Asia which has successfully completed the first National Strategy for the Development of Statistics (NSDS) Training Course in the Asian Region in December 2014 organized by PARIS21 in collaboration with the Statistical Institute for Asia and the Pacific (SIAP).

Paris-21 informed on their website of the opportunity for the voice of developing countries to be heard in the debate on data revolution which we should at least be aware of:

In the months leading up to September 2015 there will be a comprehensive process to involve as many people as possible in discussions about the data revolution, what it should do, who should be involved and how it should be put into action. It is essential that the voice of developing countries is heard in this debate and that the discussion is not hijacked by special interests or those with the deepest pockets.

Thursday, January 29, 2015

Growing old with dignity, and hope


As you grow old to be relegated to the thadu row, as my mother used to say, you would be appreciating your children's and friend's provision of warmth, food, and material comfort, and acknowledging them saying thadu-thadu-thadu. Then, how would you occupy yourselves with your free time? If you gazed somewhere and grumble about weather and noisy brats and forget to answer, I for one would fully understand you. You would now be thinking about growing old gracefully and now you are me. If the old could live with dignity, that would mean gracefully old. Now if everyone is living with dignity, no one will need yearning for growing old with dignity.

I couldn't quite remember if that was before he left Yangon or after he visited Yangon for the first time that an older friend of mine told me with relief that our brains wouldn't degenerate that much as we grow old. That was many years ago and he was a retired academic who has been living down under for quite a while.
Afterwards, I chanced to read the transcriptions of the BBC Reith Lecture of 2001, "The End of Age" by Tom Kirkwood. For me it was much more than a shot in the arm as the opening words of the first lecture carried so much promise:

"Never in human history has a population so wilfully and deliberately defied nature as has the present generation. How have we defied it? We have survived. Our unprecedented survival has produced a revolution in longevity which is shaking the foundations of societies around the world and profoundly altering our attitudes to life and death.

At the same time, science has made hitherto undreamed-of advances in human biology. The explosive force of these two revolutions coming together lies at the heart of my series of Reith Lectures, as it has been at the heart of my work. Science has new things to tell us about the process of ageing. We know now that ageing is neither inevitable nor necessary."

His series of lectures, (i) Brave Old World, (ii) Thread of Life, (iii) Sex and Death, (iv) Making Choices, and (v) New Directions concluded with these words:

In this series of lectures, I challenge science and society to look afresh at what is happening in our world, to recognise the opportunities, as well as the threats to future stability, that stem from the revolution in longevity.

I challenge the scientific community to think not only of directing energy towards curing illnesses, but to turn increasingly towards the less glamorous but vital task of helping our ageing cells to guard against the drear damage of the daily grind. I challenge medicine to look in radically new ways at the maintenance of health and quality of life of older people. Can you imagine a world in which the first thing the doctor asks is not your date of birth?

I challenge society, collectively and individually, to rethink its attitudes to older people, to recognise the value and beauty of the fact that we are all living so much longer, and to make sacrifices to accommodate those who presume to live on when previously we would have died.

Above all, I challenge us all to put an end to age as something that we let get in the way of celebrating all individuals on this earth as true equals.

Needless to say that any tiny bit of solace and hope you put in our ears about doing away with inequality reverberates, especially when age is not in the way.

A bit earlier than the time of this lecture the General Assembly of the United Nations incorporated most of the international development goals (IDGs) in the Millennium Declaration in September 2000. According to Hulme in "The Millennium Development Goals (MDGs): A Short History of the World’s Biggest Promise", 2009 (http://www.bwpi.manchester.ac.uk/medialibrary/publications/working_papers/bwpi-wp-10009.pdf), the IDGs were a bag of mixed blessing, especially to issue-based NGOs:

For issue-based NGOs the response depended on the treatment of their issue. Save the Children might be pleased with Universal Primary Education and reduced child and infant mortality goals, but there was little in the IDGs for the older persons that HelpAge International assists. Environmental NGOs saw a confirmation of the Rio Declaration and a further acceptance of the arguments that development and poverty reduction had to involve environmental goals. NGOs concerned about reproductive health rights were pleased to see their main goal in the text, but women’s NGOs, and more broadly the social movement for gender equality, were livid at the watering down of the gender goal. For more radical NGOs and the emerging networks of anti-capitalist and anti-globalisation groups then the IDGs were just more of the same – capitalism trying to mask its exploitation of labour and the environment through the rhetoric of social development. But the NGO and social movement response was largely a ‘northern’ response. For NGOs in the developing world the vision of the OECD for the future of their countries and the drawing up of the IDGs barely registered. (p. 18)

Though the MDGs prevailed, not everyone agreed, and NGOs didn't concede without a fight (http://www.rorg.no/Artikler/729.html):

The UN Millennium Development Goals (MDGs) build on a set of goals first developed by the rich countries in the OECD strategy "Shaping the 21st Century"adopted in 1996. These were later the basis for a document - A Better World for All - presented in june 2000 by the OECD, the World Bank, the IMF and the UN - claiming that the document was building on the global United Nations conferences and summits of the 1990s. This was fiercely rejected by civil society, gathered in Geneva on the occation of the World Summit for Social Development +5 Conference (UNGASS), who promptly renamed the document "Bretton Woods for All" and called on the UN to withdraw its support (see NGOs call on the UN to withdraw endorsement of "A Better World for All" document). "A Better World for All" also prompted the general secretary of the World Council of Churches, Konrad Raiser, to send a personal letter to the UN Secretary General Kofi Annan expressing the concern of NGO delegates that UN support for the document "amounted to a propaganda exercise for international finance institutions whose policies are widely held to be at the root of many of the most grave social problems facing the poor all over the world and especially those in the poor nations"

All these are the feast for the mind if we care to look for them on the Web and read and form our own dumb opinions and share for fun and maybe some use.

As for the missing goals for the older persons in the MDGs, you can't complain because as Vandemoortele says in his insider's story of MDGs (http://courses.arch.vt.edu/courses/wdunaway/gia5524/vandem11.pdf):

... it is common for the representatives of the different perspectives to complain that their focus — e.g. infrastructure, governance, human rights, etc. — is not explicitly mentioned in the MDGs. Their implicit view is that the MDGs are an exhaustive list of all the things necessary for achieving human development. They usually refer to the concept of ‘MDG-plus’ meaning that their specific concerns should be added to the MDGs. This, however, would be self-defeating. If all aspects of development were to be included, the MDGs would become overloaded and incomprehensible to their primary users. (p. 8)

As for who the primary users were, he said "The fundamental purpose of the MDGs is not for each and every country to meet the global targets, which would be utopian. Their ultimate aim is to help align national priorities with the MDG agenda so as to foster human well-being. Therefore, the intended users are primarily politicians, parliamentarians, preachers, teachers and journalists." If so, you may wonder where does the people stand. Are they to be fed through politicians, parliamentarians, preachers, teachers and journalists only?

In our society, there's nothing much to talk and think about caring for our own children (as if granting the loan) and our parents (as if paying our debt), and also for helping other relatives within our capacity. The parrot king in Sālikedāra-Jātaka said this for us all (translated by W.H.D. Rouse, 1901):

"My callow chicks, my tender brood, whose wings are still ungrown,
Who shall support me by and bye: to them I grant the loan.

"Then my old ancient parents, who far from youth's bounds are set,
With that within my beak I bring, to them I pay my debt.

"And other birds of helpless wing, and weak full many more,
To these I give in charity: this sages call my store.

"This is that loan the which I grant, this is the debt I pay,
And this the treasure I store up: now I have said my say."

That means so long as we keep our values like these and our bread winners are healthy and have equal opportunities no elderly or anyone, needs to worry, as least for meeting our most basic needs. We have seen that the crucial element for our workforce to be able to "paying their debt and granting their loans" would depends on effectively meeting the sub-goal "Achieve Decent Employment for Women, Men, and Young People" of the goal-1 of the Millennium Development Goals: Eradicate Extreme Poverty and Hunger.  For that and other entitlements and obligations to our people we hoped that our national aspirations as formulated and implemented by our leaders would largely have been in line with the MDGs and their achievements accumulated by the end of 2015 would be decent or better.

Reporting to the Sixty-Eight session of the UN General Assembly, the Secretary General in his report "A life of dignity for all: accelerating progress towards the Millennium Development Goals and advancing the United Nations development agenda beyond 2015" in July 2013, while acknowledging that more than a billion people still live in extreme poverty and far too many people face serious deprivation in health and education, with progress hampered by significant inequality related to income, gender, ethnicity, disability, age and location he was optimistic. He said "Ours is the first generation with the resources and know-how to end extreme poverty and put our planet on a sustainable course before it is too late" and pointed out that the transition to sustainable development as underscored in the outcome document of the United Nations Conference on Sustainable Development, held in Rio de Janeiro, Brazil, in 2012 is the key to post-2015 development agenda.

I noticed that this report mentioned something about the elderly: under paragraph 92. "Address demographic challenges" we find "... Countries with an ageing population need policy responses to support the elderly so as to remove barriers to their full participation in society while protecting their rights and dignity". I was glad particularly because the Secretary General seemed to count on us through these words: their full participation in society. But I have doubts whether the qualifier "countries with an ageing population" would give the ground for the excuse to ignore the elderly for countries with relatively low expectation of life like ours.

Equality of opportunity is the key ingredient for leaving no one behind (presumably including the elderly) and paragraph 84. Tackle exclusion and inequality says:  "In order to leave no one behind and bring everyone forward, actions are needed to promote equality of opportunity. This implies inclusive economies in which men and women have access to decent employment, legal identification, financial services, infrastructure and social protection, as well as societies where all people can contribute and participate in national and local governance".

With the MDG's deadline at the end of 2015 in sight, the feverish preparations for the post-2015 agenda has been crowned with the UN Secretary General's long awaited synthesis report: The Road to Dignity by 2030: Ending Poverty, Transforming All Lives and Protecting the Planet came out in December 4, 2014. The road to dignity relies on six essential elements of delivering the sustainable development goals:


In their letter to the UN Secretary General from the Co-Chairs of the High-Level Panel of Eminent Persons on the Post-2015 Development Agenda of 30 May 2013 the Panel stated: "We transmit our recommendations to you with a feeling of great optimism that a transformation to end poverty through sustainable development is possible within our generation". For the common people from a developing country like you and me, these respectable leaders sound sincere and convincing. If they feel comfortable with the post-2015 development agenda, why shouldn't we? Granted that their visionary judgment is right and all the sources from which the Secretary General drew to synthesize his Road to Dignity by 2030 deliberations were reliable we won't risk too much in hoping that poverty could indeed be eradicated from this world "within our generation". To simplify matters I would prefer to think of this time frame as "in my lifetime".

But that's a tall order. The catch is that only each of the nations itself could make that happen through political will and mobilizing certain level of its own capacity. The latter requisite could be augmented through global solidarity and partnership, though.

The SDGs deliberated in the Road to Dignity by 2030 report still awaits final agreement of the states and almost immediately there were interesting reactions to it from the global community. An issue that interested me much is the definition of "dignity". I managed to find only one instance of questioning means for achieving dignity. He doubted the report's suggestion that dignity could be attained by ending poverty and fighting inequality, and asked "... will they result in dignity for all? Or will aiming for dignity for all help end poverty and inequality?".  He also believed that the poor could possess dignity. For me, ending poverty and fighting inequality are essential for bestowing dignity but the "Justice" element will be needed to consolidate it. I also like to see that hopes are kept alive to complete my vision on dignity.


The report recognizes the need to “remove obstacles to full participation by persons with disabilities, older persons, adolescents and youth, and empower the poor” (Para 68) but still treats people more as recipients of development than active agents and drivers of change.

On sustainability and economic growth, we are concerned that the Secretary-General appears to underscore the need to retain an approach based on economic growth as the solution to our global challenges, rather than recognizing that it has created or contributed to many of those challenges. We recognize though that the report does mention the need for the economy to serve people and planet. The real transformation of our economies (Para 54) will only be achieved if we take a path that addresses inequity and environmental and social costs of business-as-usual, measures progress “beyond GDP” and complies with human rights obligations.

Goals, Targets and Indicators
... Progress must be measured in ways that “go beyond GDP and account for human well-being, sustainability and equity” (Para 72). Availability and access to data, including disaggregated information (Para 46) are key concerns, and the report still misses the active participation of people. Instead of referring to a world where everyone counts, the vision is a world where everyone ‘is counted’ (Para 31).

A Participatory accountability, monitoring and review mechanism
... The section on monitoring, evaluation and reporting falls short on citizens' (including children and youth) participation and presents citizens largely as beneficiaries than active actors in the implementation and accountability.


Most of us are aware of year 2015 as the landmark year for our national elections to be held as a step towards advancing further in our quest for a fully democratic society. Quite a few would have known that the global development agenda based on MDGs would expire by this year's end and the Post-2015 development agenda based on Sustainable Development Goals will take over from then on.


Obviously, neither one should distract the other.  They could be mutually reinforcing and harmonious. The political awakening we are gaining from the election process may be directed to reinforce the heightened awareness and resolution for national progress in working with the still unfinished business of the MDGs and the UN Post-2015 development agenda, and the other way round. 

Thursday, January 8, 2015

Correlates of labor productivity growth


Since the new correlation measures are good at detecting nonlinear associations as well as linear ones they would be ideal for exploring complex dependencies between pairs of random variables. In big data, dependencies between thousands of such pairs would be computed, ranked in order of their strengths, and those with high enough dependencies would be investigated further. 

These are the cases where "... the pairwise relationship between many variables is simultaneously explored. In statistics, this exploration is formalized in a multiple hypothesis testing framework, where the null hypothesis of statistical independence is examined for every pair of variables. Then, the p-values of the tests serve as a basis for generating final conclusions. Specifically, the pairs of variables are ordered by their p-values (or the adjusted p-values after correcting for multiple testing) in increasing order, and the pairs with the lowest p-values will be further studied.  Reshef et al. recommended ranking the pairs based on MIC, which in this case is equivalent to ranking based on the p-values of the MIC tests, for fixed sample size. " (Comment on "Detecting Novel Associations in Large Data Sets", Gorfine et al, 2012, http://iew3.technion.ac.il/~gorfinm/files/science6.pdf).

The following figure from "A comparative study of statistical methods used to identify dependencies between gene expression signals", Santos et al., 2013
(http://www.princeton.edu/~dtakahas/publications/Brief%20Bioinform-2013-de%20Siqueira%20Santos) summarizes the type of dependencies and suitable methodologies to capture them.


In the previous post we had explored the new dependency measures MIC, dCor, and HHG on the baseball dataset used by Rashef et al. in their original 2011 paper on MIC. Following their example of analyzing the WHO datasets here we try analyzing the World Bank Enterprise Surveys indicators freely available on their data portal. We downloaded all indicators under 13 available topics for all available countries of (i) East Asia & Pacific, and (ii) Sub-Saharan Africa. The data covered 77 countries/year (including Lao, Cambodia, Malaysia, Myanmar, Philippines, Thailand, and Vietnam in the ASEAN) and 125 indicators. After excluding some countries and some indicators to maximize the number of indicators available, 67 countries/year and 78 indicators remained.

To explore the dependencies of other 77 indicators on the "Annual labor productivity growth (%)" of manufacturing firms, we run dependency analysis with Pearson, MIC, dCor, and HHG methods for each pair of indicators. The results:





Not being an economist, I've nothing much to say about interpreting the results or making sense out of them. To our people, I just meant to draw attention to the wonderful world of free/open-source software as well as data in the public domain and some new tools that have been created for analyzing them. I am sure the more energetic ones will acquire the micro data for Myanmar from the World Bank data portal or other places, and go on to analyze to add to our skills, and to gain insights to contribute to the pool of knowledge we badly need.


Friday, January 2, 2015

Big data: hands-on correlation, old and new



The first time I'd heard about maximal information coefficient (MIC) was when I came across last year the article “'Detecting Novel Associations in Large Data Sets' — let the giants battle it out!” [http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/]
There the author of the blog draws attention to the enthusiastic acceptance of MIC by Terry Speed as “a correlation for the 21st century” and in contrast the comment by Noah Simon and Rob Tibshirani which pointed out MIC's shortcomings such as its "serious power deficiencies, and hence when it is used for large-scale exploratory analysis it will produce too many false positives". The latter recommended the distance correlation measure (dCor) of Székely & Rizzo (2009) for general use. 

Two weeks ago I was curious about the results of this battle and looked again. Then I come to know about HHG measure in addition to the MIC and dcor, as well as some others. 

The well known traditional measure of dependence/independence between two random variables is the Pearson Correlation Coefficient which is still widely used today.  Its features are well characterized by the following illustration (Correlation, Wikipedia).
As the last row of the figures shows, Pearson correlation coefficient could not capture the nonlinear relationships existing in any of those (except the last figure on the right, the four independent clouds, which has no relationship between X and Y).

Moreover correlation coefficient is a summary statistic and so cannot replace the individual examination of the data as illustrated below where each of the individual plots has the same correlation coefficient of 0.8! (Correlation, Wikipedia).



By visually inspecting individual scatterplots you may be able to reduce false-positive and false negative rates due to the inadequacies in the Pearson correlation measure. However, this ideal situation of being able to view the scatterplots of all the potential pairs of variables of interest is no longer possible in big data where thousands of variables are measured simultaneously. In the yeast expression data analyzed in the paper by Wang et al, with 6,000 genes, there are around 18,000,000 gene pairs, and it is a daunting task to sort through these many pairs to identify those having genuine dependencies (Putting things in order, Sun and Zhao, PNAS, November 18, 2014).

Recent trend is in developing methods to capture complex dependencies between pairs of random variables. This is because in many modern applications, dependencies of interest may not be of simple forms, and therefore the classical methods cannot capture them. For example the distance correlation coefficient (dCor) could capture nonlinear relationships as shown below (Distance Correlation, Wikipedia).

The Maximal Information Coefficient (MIC) is based on concepts from information theory. Mutual information provides the amount of information one variable reveals about another between variables of any type and does not depend on the functional form underlying the relationship. The MIC (Reshef et al., 2011) can be seen as the continuous variable counterpart to mutual information.

Distance correlation (dCor) is a measure of association (Székely et al., 2007; Székely and Rizzo, 2009 at https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453933) that uses the distances between observations as part of its calculation.

Heller-Heller-Gorfine (HHG) tests are a set of statistical tests of independence between two random vectors of arbitrary dimensions, given a finite sample (A consistent multivariate test of association based on ranks of distances, Heller et al, Biometrika, 2013). The arXiv version is available at: http://arxiv.org/pdf/1201.3522v3.pdf.

As for the concepts behind the old (classical) correlation measures as well as the new correlation measures they may not be out of reach of the small guys as Michael A. Newton said (https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453932) about distance correlation, for example:

Distance covariance not only provides a bona fide dependence measure, but it does so with a simplicity to satisfy Don Geman’s elevator test (i.e., a method must be sufficiently simple that it can be explained to a colleague in the time it takes to go between floors on an elevator!).

The theories behind these new correlation measures are far too deep for me. However, you should not be discouraged from trying them out and as usual you can find appropriate R packages for their implementation. For MIC you can use the function mine( ) in minerva package; for distance correlation you can use the function dcor( ) in energy package; for HHG you can use the function hhg.test( ) in the HHG package.

I looked for and found my model for running these tests in the post "Maximal Information Coefficient (Part II)" of Wednesday, September 17, 2014 in the "me nugget" blog at: http://menugget.blogspot.com/2014/09/maximal-information-coefficient-part-ii.html#more.

The code provided in that blog implemented the MIC and Pearson correlations for the baseball data set used in Rashaf et al's 2011 original article on MIC. There, 130 variables were correlated against a baseball player's salary from the MLB2008.csv data set available at
http://www.exploredata.net/Downloads/Baseball-Data-Set.

I extended the analysis to run the dCor and HHG tests as well. The results are:
(i) Comparison with top 10 ranking MIC coefficients

              MIC MIC_Rank   Pearson Pearson_Rank      dCor dCor_Rank      HHG
D_RPMLV 0.3688595        1 0.3569901           14 0.3353516        18 588103.5
H       0.3665573        2 0.3162080           37 0.3070682        39 564774.9
TB      0.3613143        3 0.3482234           20 0.3376913        16 698983.3
PA      0.3599480        4 0.3239600           31 0.3227682        25 656445.1
BALLS   0.3559231        5 0.3686719            8 0.3595044         4 732985.0
LD      0.3540088        6 0.3078039           40 0.3076449        36 539350.6
PA.     0.3498458        7 0.3231203           32 0.3219160        27 644955.3
TOB     0.3485658        8 0.3681359            9 0.3530406         6 729775.3
FB      0.3462294        9 0.2848727           52 0.3117478        34 584117.7
STRIKES 0.3450615       10 0.3059649           41 0.3096172        35 599453.5

        HHG_Rank HHG_perm.pval.hhg.sc
D_RPMLV       30         0.0009950249
H             36         0.0009950249
TB            10         0.0009950249
PA            19         0.0009950249
BALLS          7         0.0009950249
LD            44         0.0009950249
PA.           21         0.0009950249
TOB            8         0.0009950249
FB            32         0.0009950249
STRIKES       27         0.0009950249


(ii) Comparison with top 10 ranking dCor coefficients

            MIC MIC_Rank   Pearson Pearson_Rank      dCor dCor_Rank      HHG
BB    0.3443067       11 0.4042573            1 0.3911814         1 790049.7
IBB   0.2805604       82 0.4033611            2 0.3785360         2 812153.9
UBB   0.3425437       13 0.3706693            6 0.3674838         3 707229.5
BALLS 0.3559231        5 0.3686719            8 0.3595044         4 732985.0
D_RAR 0.3335061       23 0.3771942            4 0.3532652         5 654073.9
TOB   0.3485658        8 0.3681359            9 0.3530406         6 729775.3
RBI   0.3142433       45 0.3824583            3 0.3529999         7 639921.9
D_EqR 0.3376202       20 0.3679313           10 0.3470629         8 696841.2
DP    0.3206635       38 0.3613956           12 0.3461126         9 603111.2
R1_BI 0.3025373       59 0.3741631            5 0.3431456        10 562235.0

      HHG_Rank HHG_perm.pval.hhg.sc
BB           4         0.0009950249
IBB          3         0.0009950249
UBB          9         0.0009950249
BALLS        7         0.0009950249
D_RAR       20         0.0009950249
TOB          8         0.0009950249
RBI         23         0.0009950249
D_EqR       11         0.0009950249
DP          26         0.0009950249
R1_BI       39         0.0009950249

(iii) Comparison with top 10 ranking HHG coefficients

            MIC MIC_Rank    Pearson Pearson_Rank      dCor dCor_Rank      HHG
PA_DH 0.2839103       80  0.2653424           61 0.2927642        50 933967.9
G_PR  0.2941879       70 -0.2102295           72 0.2639014        63 882670.1
IBB   0.2805604       82  0.4033611            2 0.3785360         2 812153.9
BB    0.3443067       11  0.4042573            1 0.3911814         1 790049.7
G_DH  0.2839103       79  0.2482047           63 0.2705099        62 774766.0
SHR   0.3012459       62 -0.2777007           57 0.3128214        33 758804.8
BALLS 0.3559231        5  0.3686719            8 0.3595044         4 732985.0
TOB   0.3485658        8  0.3681359            9 0.3530406         6 729775.3
UBB   0.3425437       13  0.3706693            6 0.3674838         3 707229.5
TB    0.3613143        3  0.3482234           20 0.3376913        16 698983.3

      HHG_Rank HHG_perm.pval.hhg.sc
PA_DH        1         0.0009950249
G_PR         2         0.0009950249
IBB          3         0.0009950249
BB           4         0.0009950249
G_DH         5         0.0009950249
SHR          6         0.0009950249
BALLS        7         0.0009950249
TOB          8         0.0009950249
UBB          9         0.0009950249
TB          10         0.0009950249




To get a feeling of how the new correlation analyses work with nonlinear associations I created and used the following nonlinear relationships that were described by M. A. Newton as test data. Each could be generated by the function hhg.example.datagen() in the HHG package of the R statistical environment.