Thursday, January 8, 2015

Correlates of labor productivity growth


Since the new correlation measures are good at detecting nonlinear associations as well as linear ones they would be ideal for exploring complex dependencies between pairs of random variables. In big data, dependencies between thousands of such pairs would be computed, ranked in order of their strengths, and those with high enough dependencies would be investigated further. 

These are the cases where "... the pairwise relationship between many variables is simultaneously explored. In statistics, this exploration is formalized in a multiple hypothesis testing framework, where the null hypothesis of statistical independence is examined for every pair of variables. Then, the p-values of the tests serve as a basis for generating final conclusions. Specifically, the pairs of variables are ordered by their p-values (or the adjusted p-values after correcting for multiple testing) in increasing order, and the pairs with the lowest p-values will be further studied.  Reshef et al. recommended ranking the pairs based on MIC, which in this case is equivalent to ranking based on the p-values of the MIC tests, for fixed sample size. " (Comment on "Detecting Novel Associations in Large Data Sets", Gorfine et al, 2012, http://iew3.technion.ac.il/~gorfinm/files/science6.pdf).

The following figure from "A comparative study of statistical methods used to identify dependencies between gene expression signals", Santos et al., 2013
(http://www.princeton.edu/~dtakahas/publications/Brief%20Bioinform-2013-de%20Siqueira%20Santos) summarizes the type of dependencies and suitable methodologies to capture them.


In the previous post we had explored the new dependency measures MIC, dCor, and HHG on the baseball dataset used by Rashef et al. in their original 2011 paper on MIC. Following their example of analyzing the WHO datasets here we try analyzing the World Bank Enterprise Surveys indicators freely available on their data portal. We downloaded all indicators under 13 available topics for all available countries of (i) East Asia & Pacific, and (ii) Sub-Saharan Africa. The data covered 77 countries/year (including Lao, Cambodia, Malaysia, Myanmar, Philippines, Thailand, and Vietnam in the ASEAN) and 125 indicators. After excluding some countries and some indicators to maximize the number of indicators available, 67 countries/year and 78 indicators remained.

To explore the dependencies of other 77 indicators on the "Annual labor productivity growth (%)" of manufacturing firms, we run dependency analysis with Pearson, MIC, dCor, and HHG methods for each pair of indicators. The results:





Not being an economist, I've nothing much to say about interpreting the results or making sense out of them. To our people, I just meant to draw attention to the wonderful world of free/open-source software as well as data in the public domain and some new tools that have been created for analyzing them. I am sure the more energetic ones will acquire the micro data for Myanmar from the World Bank data portal or other places, and go on to analyze to add to our skills, and to gain insights to contribute to the pool of knowledge we badly need.


1 comment:

  1. In table(i), description for variable "X5_firm_9" had been incorrect. Now corrected. Apologies.

    ReplyDelete