Bayanathi Technology: Big data: hands-on correlation, old and new

The first time I'd heard about maximal information coefficient (MIC) was when I came across last year the article “'Detecting Novel Associations in Large Data Sets' — let the giants battle it out!” [http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/]

There the author of the blog draws attention to the enthusiastic acceptance of MIC by Terry Speed as “a correlation for the 21st century” and in contrast the comment by Noah Simon and Rob Tibshirani which pointed out MIC's shortcomings such as its "serious power deficiencies, and hence when it is used for large-scale exploratory analysis it will produce too many false positives". The latter recommended the distance correlation measure (dCor) of Székely & Rizzo (2009) for general use.

Two weeks ago I was curious about the results of this battle and looked again. Then I come to know about HHG measure in addition to the MIC and dcor, as well as some others.

The well known traditional measure of dependence/independence between two random variables is the Pearson Correlation Coefficient which is still widely used today. Its features are well characterized by the following illustration (Correlation, Wikipedia).

As the last row of the figures shows, Pearson correlation coefficient could not capture the nonlinear relationships existing in any of those (except the last figure on the right, the four independent clouds, which has no relationship between X and Y).

Moreover correlation coefficient is a summary statistic and so cannot replace the individual examination of the data as illustrated below where each of the individual plots has the same correlation coefficient of 0.8! (Correlation, Wikipedia).

By visually inspecting individual scatterplots you may be able to reduce false-positive and false negative rates due to the inadequacies in the Pearson correlation measure. However, this ideal situation of being able to view the scatterplots of all the potential pairs of variables of interest is no longer possible in big data where thousands of variables are measured simultaneously. In the yeast expression data analyzed in the paper by Wang et al, with 6,000 genes, there are around 18,000,000 gene pairs, and it is a daunting task to sort through these many pairs to identify those having genuine dependencies (Putting things in order, Sun and Zhao, PNAS, November 18, 2014).

Recent trend is in developing methods to capture complex dependencies between pairs of random variables. This is because in many modern applications, dependencies of interest may not be of simple forms, and therefore the classical methods cannot capture them. For example the distance correlation coefficient (dCor) could capture nonlinear relationships as shown below (Distance Correlation, Wikipedia).

The Maximal Information Coefficient (MIC) is based on concepts from information theory. Mutual information provides the amount of information one variable reveals about another between variables of any type and does not depend on the functional form underlying the relationship. The MIC (Reshef et al., 2011) can be seen as the continuous variable counterpart to mutual information.

Distance correlation (dCor) is a measure of association (Székely et al., 2007; Székely and Rizzo, 2009 at https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453933) that uses the distances between observations as part of its calculation.

Heller-Heller-Gorfine (HHG) tests are a set of statistical tests of independence between two random vectors of arbitrary dimensions, given a finite sample (A consistent multivariate test of association based on ranks of distances, Heller et al, Biometrika, 2013). The arXiv version is available at: http://arxiv.org/pdf/1201.3522v3.pdf.

As for the concepts behind the old (classical) correlation measures as well as the new correlation measures they may not be out of reach of the small guys as Michael A. Newton said (https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453932) about distance correlation, for example:

Distance covariance not only provides a bona fide dependence measure, but it does so with a simplicity to satisfy Don Geman’s elevator test (i.e., a method must be sufficiently simple that it can be explained to a colleague in the time it takes to go between floors on an elevator!).

The theories behind these new correlation measures are far too deep for me. However, you should not be discouraged from trying them out and as usual you can find appropriate R packages for their implementation. For MIC you can use the function mine( ) in minerva package; for distance correlation you can use the function dcor( ) in energy package; for HHG you can use the function hhg.test( ) in the HHG package.

I looked for and found my model for running these tests in the post "Maximal Information Coefficient (Part II)" of Wednesday, September 17, 2014 in the "me nugget" blog at: http://menugget.blogspot.com/2014/09/maximal-information-coefficient-part-ii.html#more.

The code provided in that blog implemented the MIC and Pearson correlations for the baseball data set used in Rashaf et al's 2011 original article on MIC. There, 130 variables were correlated against a baseball player's salary from the MLB2008.csv data set available at http://www.exploredata.net/Downloads/Baseball-Data-Set.

I extended the analysis to run the dCor and HHG tests as well. The results are:

(i) Comparison with top 10 ranking MIC coefficients

MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG

D_RPMLV 0.3688595 1 0.3569901 14 0.3353516 18 588103.5

H 0.3665573 2 0.3162080 37 0.3070682 39 564774.9

TB 0.3613143 3 0.3482234 20 0.3376913 16 698983.3

PA 0.3599480 4 0.3239600 31 0.3227682 25 656445.1

BALLS 0.3559231 5 0.3686719 8 0.3595044 4 732985.0

LD 0.3540088 6 0.3078039 40 0.3076449 36 539350.6

PA. 0.3498458 7 0.3231203 32 0.3219160 27 644955.3

TOB 0.3485658 8 0.3681359 9 0.3530406 6 729775.3

FB 0.3462294 9 0.2848727 52 0.3117478 34 584117.7

STRIKES 0.3450615 10 0.3059649 41 0.3096172 35 599453.5

HHG_Rank HHG_perm.pval.hhg.sc

D_RPMLV 30 0.0009950249

H 36 0.0009950249

TB 10 0.0009950249

PA 19 0.0009950249

BALLS 7 0.0009950249

LD 44 0.0009950249

PA. 21 0.0009950249

TOB 8 0.0009950249

FB 32 0.0009950249

STRIKES 27 0.0009950249

(ii) Comparison with top 10 ranking dCor coefficients

MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG

BB 0.3443067 11 0.4042573 1 0.3911814 1 790049.7

IBB 0.2805604 82 0.4033611 2 0.3785360 2 812153.9

UBB 0.3425437 13 0.3706693 6 0.3674838 3 707229.5

BALLS 0.3559231 5 0.3686719 8 0.3595044 4 732985.0

D_RAR 0.3335061 23 0.3771942 4 0.3532652 5 654073.9

TOB 0.3485658 8 0.3681359 9 0.3530406 6 729775.3

RBI 0.3142433 45 0.3824583 3 0.3529999 7 639921.9

D_EqR 0.3376202 20 0.3679313 10 0.3470629 8 696841.2

DP 0.3206635 38 0.3613956 12 0.3461126 9 603111.2

R1_BI 0.3025373 59 0.3741631 5 0.3431456 10 562235.0

HHG_Rank HHG_perm.pval.hhg.sc

BB 4 0.0009950249

IBB 3 0.0009950249

UBB 9 0.0009950249

BALLS 7 0.0009950249

D_RAR 20 0.0009950249

TOB 8 0.0009950249

RBI 23 0.0009950249

D_EqR 11 0.0009950249

DP 26 0.0009950249

R1_BI 39 0.0009950249

(iii) Comparison with top 10 ranking HHG coefficients

MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG

PA_DH 0.2839103 80 0.2653424 61 0.2927642 50 933967.9

G_PR 0.2941879 70 -0.2102295 72 0.2639014 63 882670.1

IBB 0.2805604 82 0.4033611 2 0.3785360 2 812153.9

BB 0.3443067 11 0.4042573 1 0.3911814 1 790049.7

G_DH 0.2839103 79 0.2482047 63 0.2705099 62 774766.0

SHR 0.3012459 62 -0.2777007 57 0.3128214 33 758804.8

BALLS 0.3559231 5 0.3686719 8 0.3595044 4 732985.0

TOB 0.3485658 8 0.3681359 9 0.3530406 6 729775.3

UBB 0.3425437 13 0.3706693 6 0.3674838 3 707229.5

TB 0.3613143 3 0.3482234 20 0.3376913 16 698983.3

HHG_Rank HHG_perm.pval.hhg.sc

PA_DH 1 0.0009950249

G_PR 2 0.0009950249

IBB 3 0.0009950249

BB 4 0.0009950249

G_DH 5 0.0009950249

SHR 6 0.0009950249

BALLS 7 0.0009950249

TOB 8 0.0009950249

UBB 9 0.0009950249

TB 10 0.0009950249

To get a feeling of how the new correlation analyses work with nonlinear associations I created and used the following nonlinear relationships that were described by M. A. Newton as test data. Each could be generated by the function hhg.example.datagen() in the HHG package of the R statistical environment.

Friday, January 2, 2015

Big data: hands-on correlation, old and new

No comments:

Post a Comment

Blog Archive