Friday, January 2, 2015

Big data: hands-on correlation, old and new



The first time I'd heard about maximal information coefficient (MIC) was when I came across last year the article “'Detecting Novel Associations in Large Data Sets' — let the giants battle it out!” [http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/]
There the author of the blog draws attention to the enthusiastic acceptance of MIC by Terry Speed as “a correlation for the 21st century” and in contrast the comment by Noah Simon and Rob Tibshirani which pointed out MIC's shortcomings such as its "serious power deficiencies, and hence when it is used for large-scale exploratory analysis it will produce too many false positives". The latter recommended the distance correlation measure (dCor) of Székely & Rizzo (2009) for general use. 

Two weeks ago I was curious about the results of this battle and looked again. Then I come to know about HHG measure in addition to the MIC and dcor, as well as some others. 

The well known traditional measure of dependence/independence between two random variables is the Pearson Correlation Coefficient which is still widely used today.  Its features are well characterized by the following illustration (Correlation, Wikipedia).
As the last row of the figures shows, Pearson correlation coefficient could not capture the nonlinear relationships existing in any of those (except the last figure on the right, the four independent clouds, which has no relationship between X and Y).

Moreover correlation coefficient is a summary statistic and so cannot replace the individual examination of the data as illustrated below where each of the individual plots has the same correlation coefficient of 0.8! (Correlation, Wikipedia).



By visually inspecting individual scatterplots you may be able to reduce false-positive and false negative rates due to the inadequacies in the Pearson correlation measure. However, this ideal situation of being able to view the scatterplots of all the potential pairs of variables of interest is no longer possible in big data where thousands of variables are measured simultaneously. In the yeast expression data analyzed in the paper by Wang et al, with 6,000 genes, there are around 18,000,000 gene pairs, and it is a daunting task to sort through these many pairs to identify those having genuine dependencies (Putting things in order, Sun and Zhao, PNAS, November 18, 2014).

Recent trend is in developing methods to capture complex dependencies between pairs of random variables. This is because in many modern applications, dependencies of interest may not be of simple forms, and therefore the classical methods cannot capture them. For example the distance correlation coefficient (dCor) could capture nonlinear relationships as shown below (Distance Correlation, Wikipedia).

The Maximal Information Coefficient (MIC) is based on concepts from information theory. Mutual information provides the amount of information one variable reveals about another between variables of any type and does not depend on the functional form underlying the relationship. The MIC (Reshef et al., 2011) can be seen as the continuous variable counterpart to mutual information.

Distance correlation (dCor) is a measure of association (Székely et al., 2007; Székely and Rizzo, 2009 at https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453933) that uses the distances between observations as part of its calculation.

Heller-Heller-Gorfine (HHG) tests are a set of statistical tests of independence between two random vectors of arbitrary dimensions, given a finite sample (A consistent multivariate test of association based on ranks of distances, Heller et al, Biometrika, 2013). The arXiv version is available at: http://arxiv.org/pdf/1201.3522v3.pdf.

As for the concepts behind the old (classical) correlation measures as well as the new correlation measures they may not be out of reach of the small guys as Michael A. Newton said (https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453932) about distance correlation, for example:

Distance covariance not only provides a bona fide dependence measure, but it does so with a simplicity to satisfy Don Geman’s elevator test (i.e., a method must be sufficiently simple that it can be explained to a colleague in the time it takes to go between floors on an elevator!).

The theories behind these new correlation measures are far too deep for me. However, you should not be discouraged from trying them out and as usual you can find appropriate R packages for their implementation. For MIC you can use the function mine( ) in minerva package; for distance correlation you can use the function dcor( ) in energy package; for HHG you can use the function hhg.test( ) in the HHG package.

I looked for and found my model for running these tests in the post "Maximal Information Coefficient (Part II)" of Wednesday, September 17, 2014 in the "me nugget" blog at: http://menugget.blogspot.com/2014/09/maximal-information-coefficient-part-ii.html#more.

The code provided in that blog implemented the MIC and Pearson correlations for the baseball data set used in Rashaf et al's 2011 original article on MIC. There, 130 variables were correlated against a baseball player's salary from the MLB2008.csv data set available at
http://www.exploredata.net/Downloads/Baseball-Data-Set.

I extended the analysis to run the dCor and HHG tests as well. The results are:
(i) Comparison with top 10 ranking MIC coefficients

              MIC MIC_Rank   Pearson Pearson_Rank      dCor dCor_Rank      HHG
D_RPMLV 0.3688595        1 0.3569901           14 0.3353516        18 588103.5
H       0.3665573        2 0.3162080           37 0.3070682        39 564774.9
TB      0.3613143        3 0.3482234           20 0.3376913        16 698983.3
PA      0.3599480        4 0.3239600           31 0.3227682        25 656445.1
BALLS   0.3559231        5 0.3686719            8 0.3595044         4 732985.0
LD      0.3540088        6 0.3078039           40 0.3076449        36 539350.6
PA.     0.3498458        7 0.3231203           32 0.3219160        27 644955.3
TOB     0.3485658        8 0.3681359            9 0.3530406         6 729775.3
FB      0.3462294        9 0.2848727           52 0.3117478        34 584117.7
STRIKES 0.3450615       10 0.3059649           41 0.3096172        35 599453.5

        HHG_Rank HHG_perm.pval.hhg.sc
D_RPMLV       30         0.0009950249
H             36         0.0009950249
TB            10         0.0009950249
PA            19         0.0009950249
BALLS          7         0.0009950249
LD            44         0.0009950249
PA.           21         0.0009950249
TOB            8         0.0009950249
FB            32         0.0009950249
STRIKES       27         0.0009950249


(ii) Comparison with top 10 ranking dCor coefficients

            MIC MIC_Rank   Pearson Pearson_Rank      dCor dCor_Rank      HHG
BB    0.3443067       11 0.4042573            1 0.3911814         1 790049.7
IBB   0.2805604       82 0.4033611            2 0.3785360         2 812153.9
UBB   0.3425437       13 0.3706693            6 0.3674838         3 707229.5
BALLS 0.3559231        5 0.3686719            8 0.3595044         4 732985.0
D_RAR 0.3335061       23 0.3771942            4 0.3532652         5 654073.9
TOB   0.3485658        8 0.3681359            9 0.3530406         6 729775.3
RBI   0.3142433       45 0.3824583            3 0.3529999         7 639921.9
D_EqR 0.3376202       20 0.3679313           10 0.3470629         8 696841.2
DP    0.3206635       38 0.3613956           12 0.3461126         9 603111.2
R1_BI 0.3025373       59 0.3741631            5 0.3431456        10 562235.0

      HHG_Rank HHG_perm.pval.hhg.sc
BB           4         0.0009950249
IBB          3         0.0009950249
UBB          9         0.0009950249
BALLS        7         0.0009950249
D_RAR       20         0.0009950249
TOB          8         0.0009950249
RBI         23         0.0009950249
D_EqR       11         0.0009950249
DP          26         0.0009950249
R1_BI       39         0.0009950249

(iii) Comparison with top 10 ranking HHG coefficients

            MIC MIC_Rank    Pearson Pearson_Rank      dCor dCor_Rank      HHG
PA_DH 0.2839103       80  0.2653424           61 0.2927642        50 933967.9
G_PR  0.2941879       70 -0.2102295           72 0.2639014        63 882670.1
IBB   0.2805604       82  0.4033611            2 0.3785360         2 812153.9
BB    0.3443067       11  0.4042573            1 0.3911814         1 790049.7
G_DH  0.2839103       79  0.2482047           63 0.2705099        62 774766.0
SHR   0.3012459       62 -0.2777007           57 0.3128214        33 758804.8
BALLS 0.3559231        5  0.3686719            8 0.3595044         4 732985.0
TOB   0.3485658        8  0.3681359            9 0.3530406         6 729775.3
UBB   0.3425437       13  0.3706693            6 0.3674838         3 707229.5
TB    0.3613143        3  0.3482234           20 0.3376913        16 698983.3

      HHG_Rank HHG_perm.pval.hhg.sc
PA_DH        1         0.0009950249
G_PR         2         0.0009950249
IBB          3         0.0009950249
BB           4         0.0009950249
G_DH         5         0.0009950249
SHR          6         0.0009950249
BALLS        7         0.0009950249
TOB          8         0.0009950249
UBB          9         0.0009950249
TB          10         0.0009950249




To get a feeling of how the new correlation analyses work with nonlinear associations I created and used the following nonlinear relationships that were described by M. A. Newton as test data. Each could be generated by the function hhg.example.datagen() in the HHG package of the R statistical environment.






No comments:

Post a Comment