The first time I'd heard about maximal information
coefficient (MIC) was when I came across last year the article “'Detecting
Novel Associations in Large Data Sets' — let the giants battle it out!” [http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/]
There the author of the blog draws attention to the
enthusiastic acceptance of MIC by Terry Speed as “a correlation for the
21st century” and in contrast the comment by Noah Simon and Rob
Tibshirani which pointed out MIC's shortcomings such as its "serious
power deficiencies, and hence when it is used for large-scale exploratory
analysis it will produce too many false positives". The latter
recommended the distance correlation measure (dCor) of Székely & Rizzo
(2009) for general use.
Two weeks ago I was curious about the results of this battle
and looked again. Then I come to know about HHG measure in addition to the MIC
and dcor, as well as some others.
The well known traditional measure of
dependence/independence between two random variables is the Pearson Correlation
Coefficient which is still widely used today. Its features are well characterized by the
following illustration (Correlation,
Wikipedia).
As the last row of the figures shows, Pearson correlation
coefficient could not capture the nonlinear relationships existing in any of
those (except the last figure on the right, the four independent clouds, which
has no relationship between X and Y).
Moreover correlation coefficient is a summary statistic and
so cannot replace the individual examination of the data as illustrated below
where each of the individual plots has the same correlation coefficient of 0.8!
(Correlation, Wikipedia).
By visually inspecting individual scatterplots you may be able to reduce false-positive and false negative rates due to the inadequacies in the Pearson correlation measure. However, this ideal situation of being able to view the scatterplots of all the potential pairs of variables of interest is no longer possible in big data where thousands of variables are measured simultaneously. In the yeast expression data analyzed in the paper by Wang et al, with 6,000 genes, there are around 18,000,000 gene pairs, and it is a daunting task to sort through these many pairs to identify those having genuine dependencies (Putting things in order, Sun and Zhao, PNAS, November 18, 2014).
Recent trend is in developing methods to capture complex dependencies between
pairs of random variables. This is because in many modern applications,
dependencies of interest may not be of simple forms, and therefore the
classical methods cannot capture them. For example the distance correlation
coefficient (dCor) could capture nonlinear relationships as shown below (Distance Correlation, Wikipedia).
The Maximal Information Coefficient (MIC) is based on
concepts from information theory. Mutual information provides the amount of
information one variable reveals about another between variables of any type
and does not depend on the functional form underlying the relationship. The MIC
(Reshef et al., 2011) can be seen as the continuous variable counterpart to
mutual information.
Distance correlation (dCor) is a measure of association
(Székely et al., 2007; Székely and Rizzo, 2009 at https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453933)
that uses the distances between observations as part of its calculation.
Heller-Heller-Gorfine (HHG) tests are a set of statistical
tests of independence between two random vectors of arbitrary dimensions, given
a finite sample (A consistent
multivariate test of association based on ranks of distances, Heller et al,
Biometrika, 2013). The arXiv version
is available at: http://arxiv.org/pdf/1201.3522v3.pdf.
As for the concepts behind the old (classical) correlation
measures as well as the new correlation measures they may not be out of reach of the small guys as Michael A. Newton said
(https://projecteuclid.org/download/pdfview_1/euclid.aoas/1267453932)
about distance correlation, for example:
Distance covariance not only provides a bona
fide dependence measure, but it does so with a
simplicity to satisfy Don Geman’s elevator test (i.e.,
a method must be sufficiently simple that it can be explained to a colleague in
the time it takes to go between floors on an elevator!).
The theories behind these new correlation measures are far
too deep for me. However, you should not be discouraged from trying them out
and as usual you can find appropriate R packages for their implementation. For
MIC you can use the function mine( )
in minerva package; for distance
correlation you can use the function dcor(
) in energy package; for HHG you
can use the function hhg.test( ) in
the HHG package.
I looked for and found my model for running these tests in
the post "Maximal Information Coefficient (Part II)" of Wednesday,
September 17, 2014 in the "me nugget"
blog at: http://menugget.blogspot.com/2014/09/maximal-information-coefficient-part-ii.html#more.
The code provided in that blog implemented the MIC and Pearson correlations for the baseball data set used in Rashaf et al's 2011 original article on MIC. There, 130 variables were correlated against a baseball player's salary from the MLB2008.csv data set available at http://www.exploredata.net/Downloads/Baseball-Data-Set.
I extended the analysis to run the dCor and HHG tests as
well. The results are:
(i) Comparison with top 10 ranking MIC coefficients
MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG
D_RPMLV
0.3688595 1 0.3569901 14 0.3353516 18 588103.5
H 0.3665573 2 0.3162080 37 0.3070682 39 564774.9
TB 0.3613143 3 0.3482234 20 0.3376913 16 698983.3
PA 0.3599480 4 0.3239600 31 0.3227682 25 656445.1
BALLS 0.3559231 5 0.3686719 8 0.3595044 4 732985.0
LD 0.3540088 6 0.3078039 40 0.3076449 36 539350.6
PA. 0.3498458 7 0.3231203 32 0.3219160 27 644955.3
TOB 0.3485658 8 0.3681359 9 0.3530406 6 729775.3
FB 0.3462294 9 0.2848727 52 0.3117478 34 584117.7
STRIKES
0.3450615 10 0.3059649 41 0.3096172 35 599453.5
HHG_Rank HHG_perm.pval.hhg.sc
D_RPMLV 30 0.0009950249
H 36 0.0009950249
TB 10 0.0009950249
PA 19 0.0009950249
BALLS 7 0.0009950249
LD 44 0.0009950249
PA. 21 0.0009950249
TOB 8 0.0009950249
FB 32 0.0009950249
STRIKES 27 0.0009950249
(ii) Comparison with top 10 ranking dCor coefficients
MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG
BB 0.3443067 11 0.4042573 1 0.3911814 1 790049.7
IBB 0.2805604 82 0.4033611 2 0.3785360 2 812153.9
UBB 0.3425437 13 0.3706693 6 0.3674838 3 707229.5
BALLS
0.3559231 5 0.3686719 8 0.3595044 4 732985.0
D_RAR
0.3335061 23 0.3771942 4 0.3532652 5 654073.9
TOB 0.3485658 8 0.3681359 9 0.3530406 6 729775.3
RBI 0.3142433 45 0.3824583 3 0.3529999 7 639921.9
D_EqR
0.3376202 20 0.3679313 10 0.3470629 8 696841.2
DP 0.3206635 38 0.3613956 12 0.3461126 9 603111.2
R1_BI
0.3025373 59 0.3741631 5 0.3431456 10 562235.0
HHG_Rank HHG_perm.pval.hhg.sc
BB 4 0.0009950249
IBB 3 0.0009950249
UBB 9 0.0009950249
BALLS 7 0.0009950249
D_RAR 20 0.0009950249
TOB 8 0.0009950249
RBI 23 0.0009950249
D_EqR 11
0.0009950249
DP 26 0.0009950249
R1_BI 39 0.0009950249
MIC MIC_Rank Pearson Pearson_Rank dCor dCor_Rank HHG
PA_DH
0.2839103 80 0.2653424 61 0.2927642 50 933967.9
G_PR 0.2941879 70 -0.2102295 72 0.2639014 63 882670.1
IBB 0.2805604 82
0.4033611 2
0.3785360 2 812153.9
BB 0.3443067 11
0.4042573 1 0.3911814 1 790049.7
G_DH 0.2839103 79
0.2482047 63
0.2705099 62 774766.0
SHR 0.3012459 62 -0.2777007 57 0.3128214 33 758804.8
BALLS
0.3559231 5 0.3686719 8 0.3595044 4 732985.0
TOB 0.3485658 8
0.3681359 9
0.3530406 6 729775.3
UBB 0.3425437 13
0.3706693 6
0.3674838 3 707229.5
TB 0.3613143 3
0.3482234 20
0.3376913 16 698983.3
HHG_Rank HHG_perm.pval.hhg.sc
PA_DH 1 0.0009950249
G_PR 2 0.0009950249
IBB 3 0.0009950249
BB 4 0.0009950249
G_DH 5 0.0009950249
SHR 6 0.0009950249
BALLS 7
0.0009950249
TOB 8 0.0009950249
UBB 9 0.0009950249
TB 10 0.0009950249
To get a feeling of how the new correlation analyses work with nonlinear associations I created and used the following nonlinear relationships that were described by M. A. Newton as test data. Each could be generated by the function hhg.example.datagen() in the HHG package of the R statistical environment.
No comments:
Post a Comment