Since the new correlation measures are good at detecting
nonlinear associations as well as linear ones they would be ideal for exploring
complex dependencies between pairs of random variables. In big data, dependencies
between thousands of such pairs would be computed, ranked in order of their
strengths, and those with high enough dependencies would be investigated
further.
These
are the cases where "... the pairwise relationship between many variables
is simultaneously explored. In statistics, this exploration is formalized in a
multiple hypothesis testing framework, where the null hypothesis of statistical
independence is examined for every pair of variables. Then, the p-values of the
tests serve as a basis for generating final conclusions. Specifically, the
pairs of variables are ordered by their p-values (or the adjusted p-values
after correcting for multiple testing) in increasing order, and the pairs with
the lowest p-values will be further studied.
Reshef et al. recommended
ranking the pairs based on MIC, which in this case is equivalent to ranking
based on the p-values of the MIC tests, for fixed sample size. " (Comment on "Detecting Novel Associations
in Large Data Sets", Gorfine et al, 2012, http://iew3.technion.ac.il/~gorfinm/files/science6.pdf).
The following figure from "A comparative study of statistical methods used to identify
dependencies between gene expression signals", Santos et al., 2013
(http://www.princeton.edu/~dtakahas/publications/Brief%20Bioinform-2013-de%20Siqueira%20Santos)
summarizes the type of dependencies and suitable methodologies to capture them.
In the previous post we had explored the new dependency
measures MIC, dCor, and HHG on the baseball dataset used by Rashef et al. in
their original 2011 paper on MIC. Following their example of analyzing the WHO
datasets here we try analyzing the World Bank Enterprise Surveys indicators
freely available on their data portal. We downloaded all indicators under 13
available topics for all available countries of (i) East Asia & Pacific,
and (ii) Sub-Saharan Africa. The data covered 77 countries/year (including Lao,
Cambodia, Malaysia, Myanmar, Philippines, Thailand, and Vietnam in the ASEAN)
and 125 indicators. After excluding some countries and some indicators to
maximize the number of indicators available, 67 countries/year and 78
indicators remained.
To explore the dependencies of other 77 indicators on the "Annual labor productivity growth (%)"
of manufacturing firms, we run dependency analysis with Pearson, MIC, dCor, and
HHG methods for each pair of indicators. The results:
Not being an economist, I've nothing much to say about
interpreting the results or making sense out of them. To our people, I just
meant to draw attention to the wonderful world of free/open-source software as
well as data in the public domain and some new tools that have been created for
analyzing them. I am sure the more energetic ones will acquire the micro data
for Myanmar from the World Bank data portal or other places, and go on to
analyze to add to our skills, and to gain insights to contribute to the pool of
knowledge we badly need.
In table(i), description for variable "X5_firm_9" had been incorrect. Now corrected. Apologies.
ReplyDelete