Sunday, July 30, 2017

Playing with microdata III: parallel sets etc



This sample of parallel sets plots gives us colorful crisp images with clear messages to match. Their favorite theme of illustration seemed to be the casualties of the Titanic disaster and as I can see parallel sets spares you the effort to create a vivid mental image of the tragic true story. This is from Kosara_BeautifulVis_2010.pdf .


My attempt at visualization of development priorities by stakeholder groups which has been the theme of my last two posts, now using parallel sets, wasn't so clear:


This may be because every stakeholder respondent has the choice of up to three development priorities out of the given thirty-five, making the result too complex for clear visualization. Justin in Games of Thrones Parallel Sets Data Visualization, gives his rule of thumb:


What can we do with our complex data? May be we could follow the lead of Markus (from my last post):


His idea is to simplify the data by taking the top 10 distinct three-response sets by number of occurrences. For this we first need to convert the dataframe based on individual respondents to grouped dataframe with distinct response triplets (r1-r2-r3) and their frequencies (n), like this:

    r1 r2 r3 n
 42 11 19 20 5
 65 19 20 21 5
  9  4  5 11 3
 61 11 20 21 3
121 18 20 34 3
 10  3  9 11 2
 23  4  5 20 2
 26  3  9 20 2
 28  2 13 20 2
 33  4 14 20 2

This resulted in the parallel sets plot:

So far so good. But the problem is that the top-10 response categories just covered 29 respondents out of 170 (17%), and biggest n is 5 only! Also, going down the rows of the dataframe is useless because we'll be getting down to frequencies of 1.
This is the script for producing my two parallel sets plots:

        # parallel sets plot using ggparallel package
        # myint thann, July 28, 2017

        # import data
        library(foreign)
        x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")

        # ---- extract and manipulate data
        # (1) extract data on development priorities + stakeholder groups       
        xa2g1r <- x[,c(1,3:37,447)]                         
        # (2) convert factors to integer
        xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)      
        # (3) convert NAs to 0 in development priorities
        xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0 
        # (4) convert 2 to 0 in development priorities
        xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
        # (5) remove persons with no response in development priorities
        xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]         
        # ---- end extract and manipulate data

        # (6) create response data
        xa2g1r$r1 <- 0
        xa2g1r$r2 <- 0
        xa2g1r$r3 <- 0

        # (7) create first, second, and third responses
        for (j in 1:170) {
            k = 38
            for(i in 2:36){
                if (xa2g1r[j,i]==1) {
                    xa2g1r[j,k] <- i
                    k=k+1
                }
                if (k > 40) break
            }
        }

        ###  run parallel sets plot
        #  (8) run parallel sets plot for all responses 
        library(ggparallel)
        ggparallel(vars = list("r1",  "r2",  "r3"), xa2g1r, method="angle")+
                         coord_flip()
        #  save plot
        ggsave("perset_all.png", width = 5, height = 5)

        #  run parallel sets plot for top ten most frequent responses 
        #  (9) create grouped data frame with number of respondents 
        #        for each distinct set of three responses
        xa2g1r_G <- xa2g1r[,38:40]
        xa2g1r_G <- aggregate(list(n=rep(1,nrow(xa2g1r_G))), xa2g1r_G, length)
        xa2g1r_G10 <- xa2g1r_G[order(xa2g1r_G$n, decreasing=TRUE),][1:10,]
        #  (10) run parallel sets plot for top ten responses
        ggparallel(list('r1', 'r2', 'r3'), xa2g1r_G10, weight = 'n', order = 0)
        #  save plot
        ggsave("perset_top10.png", width = 5, height = 5)

Notes:
  • In step (7) of my script above, I used loops to create data for regular dataframe with r1, r2, and r3. Attempting to avoid loops, I needed 10 steps to arrive at the equivalent dataframe used in my last two posts, poor me. You could certainly do that better.
  • Markus's code for producing the grouped data frame used the dplyr package:
    I used one of the three answers to Find how many times duplicated rows repeat in R data frame in stackoverflow by thelatemail that uses the aggregate function (see step-(9) in my R script above). Kudos to the versatility of R and the power of Q/A sites.






No comments:

Post a Comment