Sunday, July 30, 2017

Playing with microdata III: parallel sets etc



This sample of parallel sets plots gives us colorful crisp images with clear messages to match. Their favorite theme of illustration seemed to be the casualties of the Titanic disaster and as I can see parallel sets spares you the effort to create a vivid mental image of the tragic true story. This is from Kosara_BeautifulVis_2010.pdf .


My attempt at visualization of development priorities by stakeholder groups which has been the theme of my last two posts, now using parallel sets, wasn't so clear:


This may be because every stakeholder respondent has the choice of up to three development priorities out of the given thirty-five, making the result too complex for clear visualization. Justin in Games of Thrones Parallel Sets Data Visualization, gives his rule of thumb:


What can we do with our complex data? May be we could follow the lead of Markus (from my last post):


His idea is to simplify the data by taking the top 10 distinct three-response sets by number of occurrences. For this we first need to convert the dataframe based on individual respondents to grouped dataframe with distinct response triplets (r1-r2-r3) and their frequencies (n), like this:

    r1 r2 r3 n
 42 11 19 20 5
 65 19 20 21 5
  9  4  5 11 3
 61 11 20 21 3
121 18 20 34 3
 10  3  9 11 2
 23  4  5 20 2
 26  3  9 20 2
 28  2 13 20 2
 33  4 14 20 2

This resulted in the parallel sets plot:

So far so good. But the problem is that the top-10 response categories just covered 29 respondents out of 170 (17%), and biggest n is 5 only! Also, going down the rows of the dataframe is useless because we'll be getting down to frequencies of 1.
This is the script for producing my two parallel sets plots:

        # parallel sets plot using ggparallel package
        # myint thann, July 28, 2017

        # import data
        library(foreign)
        x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")

        # ---- extract and manipulate data
        # (1) extract data on development priorities + stakeholder groups       
        xa2g1r <- x[,c(1,3:37,447)]                         
        # (2) convert factors to integer
        xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)      
        # (3) convert NAs to 0 in development priorities
        xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0 
        # (4) convert 2 to 0 in development priorities
        xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
        # (5) remove persons with no response in development priorities
        xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]         
        # ---- end extract and manipulate data

        # (6) create response data
        xa2g1r$r1 <- 0
        xa2g1r$r2 <- 0
        xa2g1r$r3 <- 0

        # (7) create first, second, and third responses
        for (j in 1:170) {
            k = 38
            for(i in 2:36){
                if (xa2g1r[j,i]==1) {
                    xa2g1r[j,k] <- i
                    k=k+1
                }
                if (k > 40) break
            }
        }

        ###  run parallel sets plot
        #  (8) run parallel sets plot for all responses 
        library(ggparallel)
        ggparallel(vars = list("r1",  "r2",  "r3"), xa2g1r, method="angle")+
                         coord_flip()
        #  save plot
        ggsave("perset_all.png", width = 5, height = 5)

        #  run parallel sets plot for top ten most frequent responses 
        #  (9) create grouped data frame with number of respondents 
        #        for each distinct set of three responses
        xa2g1r_G <- xa2g1r[,38:40]
        xa2g1r_G <- aggregate(list(n=rep(1,nrow(xa2g1r_G))), xa2g1r_G, length)
        xa2g1r_G10 <- xa2g1r_G[order(xa2g1r_G$n, decreasing=TRUE),][1:10,]
        #  (10) run parallel sets plot for top ten responses
        ggparallel(list('r1', 'r2', 'r3'), xa2g1r_G10, weight = 'n', order = 0)
        #  save plot
        ggsave("perset_top10.png", width = 5, height = 5)

Notes:
  • In step (7) of my script above, I used loops to create data for regular dataframe with r1, r2, and r3. Attempting to avoid loops, I needed 10 steps to arrive at the equivalent dataframe used in my last two posts, poor me. You could certainly do that better.
  • Markus's code for producing the grouped data frame used the dplyr package:
    I used one of the three answers to Find how many times duplicated rows repeat in R data frame in stackoverflow by thelatemail that uses the aggregate function (see step-(9) in my R script above). Kudos to the versatility of R and the power of Q/A sites.






Monday, July 17, 2017

Playing with microdata II: my first parallel coordinates plot


This post gives “how to” of my first parallel coordinates plot with ggplot2. It wasn't my very first parallel coordinates plot because the very first was done with “parcoord” function from the MASS package. I also tried “parallelplot” of the “lattice” package, as well as “ggparallel” from the “ggparallel” package. You can't get parallel coordinate plot from the last one. I used it to draw “parallel sets plot” which is different though it is related to the parallel coordinates.

I was playing with the first two of those methods and still couldn't get the attributes of the plots such as headings, labels, and legends right when I stumbled upon “PARALLEL COORDINATE PLOTS FOR DISCRETE AND CATEGORICAL DATA IN R — A COMPARISON” by Markus Conrad in WZB Data Science Blog, from here.

The plot below

was produced by


and it needed a special type of data in “long format” using melt( ) funtion from the reshape2 package.

I thought I understood the long format idea. So I tried to do it on my own without using melt function from the reshape2 package. It worked, though the coding may have been really crude!

#-- myint thann, July 13, 2017
#-- extract data on development priorities from the microdata
#-- (Reference ID: MMR_2014_WBCS_v01_M; Country: Myanmar;
#   Producer:Public Opinion Research Group - The World Bank Group)
#-- and run first parallel coordinate plot with ggplot2

#-- import data
library(foreign)
x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")

# ---- extract and manipulate data
# (1) extract data on development priorities + stakeholder groups
xa2g1r <- x[,c(1,3:37,447)]
# (2) convert factors to integer
xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)
# (3) convert NAs to 0 in development priorities
xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0
# (4) convert 2 to 0 in development priorities
xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
# (5) remove persons with no response in development priorities
xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]
# (6) create dataframe of necessary variables for analysis
idg1 <- data.frame(row0=as.integer(row.names(xa2g1r)),xa2g1r[,c(1,37)])
row.names(idg1) <- 1:170
idg1$row <- as.integer(row.names(idg1))
# ---- end extract and manipulate data

## ---- create long form dataframe for use with ggplot2
# (7) get row, column ids of responses = 1 
#     create df, add variables for response 1,2,3
w <- which(xa2g1r[1:170,2:36]==1,arr.ind=TRUE)
w <- w[order(w[,1],w[,2]),]
wdf <- data.frame(w)
wdf$r1 <- 0
wdf$r2 <- 0
wdf$r3 <- 0
wdf$rid <- 0
# (8) make rid (response-id)
RF12 <- function (x) {
        col.x <- subset(wdf,row==x)$col
        xrid <- as.integer(as.factor(col.x))
}
wdf$rid <- unlist(apply(as.array(1:170),1,RF12))
# (9) create the list of dataframes for creating long form data for ggplot2
RF3 <- function (x) {
      r.x <- subset(wdf,row==x)
      r.x[,3] <- r.x[r.x$rid == 1,2]
      r.x[,4] <- r.x[r.x$rid == 2,2]
      r.x[,5] <- r.x[r.x$rid == 3,2]
      return(r.x)
}
z <- (apply(as.array(1:170),1,RF3))
# (9) make one dataframe from the list of 170 data frames
zdf.0 <- do.call("rbind",z)
# (10) remove duplicate rows
zdf <- zdf.0[zdf.0$rid==1,]
# (11) replace NA with 0
zdf[is.na(zdf)] <- 0
# (12) replicate rows to equal number of responses
#      for cases with less than or more than 3 responses
RF4 <- function (x) {
      z.x <- z[[x]]
      m <- nrow(z.x)
      xr <- row.names(z.x[m,])
      t <- ifelse(m==1, 2, 
                  ifelse(m==2, 1, 0))
      z.e <- z.x[rep(xr,t),]
      z.e <- rbind(z.x,z.e)
      return(z.e)
}
ze <- (apply(as.array(c(1:170)),1,RF4))   
# (13) make rid in replicated rows = (1,2,3, ...) and col=0
r123 <- function(x) {
      if (sum(ze[[x]]$rid)==5){
          ze[[x]][3,2] <- 0
          ze[[x]]$rid <- c(1,2,3)} else
      if (sum(ze[[x]]$rid)==3){ 
          ze[[x]][2,2] <- 0
          ze[[x]][3,2] <- 0
          ze[[x]]$rid <- c(1,2,3)} else          
      if (sum(ze[[x]]$rid)==6){} else
          {ze[[x]]$rid <- seq(1,nrow(ze[[x]]),1)}            
      return(ze[[x]])
} 
ze1 <- apply(as.array(1:170),1,r123)  
# (14) make data frame
zedf <- do.call("rbind",ze1) 
# (15) add stakeholder group = g1r to zedf        
pcdf <- merge(idg1,zedf,by="row")
# (16) add stakeholder group code
pcdf$sgcode <- factor(as.integer(pcdf$g1r))        
## ---- end create long form dataframe  for ggplot2

### ---- (17) run first par coord plot
library(ggplot2) 
y_levels <- levels(factor(1:35))
ggplot(pcdf, aes(x = rid, y = col, group = row)) +
       geom_path(aes(size = NULL, color = sgcode),
              alpha = 0.5,
              lineend = 'round', linejoin = 'round')+
       scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
       scale_size(breaks = NULL, range = c(1, 7)) 
### ---- end first par coord plot with ggplot2

From my R-script above the steps from (1) to (16) produces the dataframe pcdf which looks like this:

From this we need only the variables rid, col, row and sgcode. I just plugged them in the in the right places in Markus's code and voila!


For the benefit of my fellow dummies (or shall I call them citizen scientists?) I will confide to them that I struggled for a long time before I got to see a plot and then the right plot. That was before I realize that the variable for “group =” has to be the variable that specifies the observations in a given case that are to be connected by lines (for my dataframe it was “row”). Hasn't Markus emphasized “# group = id is important!”?

The above plot was the prototype from which … I worked through trial and error with the help of various tutorials and question-answers from Cross Validated and Stack Overflow, among others … as I noted in my last post. Thanks to Marus and other guys I manged to get the parallel coordinate plots shown there.


Now is your turn.

Thursday, July 13, 2017

Playing with microdata


A little over a month ago I happened to have read the news about Myanmar Demographic and Health Survey and by the first week of June I had been able to download the microdata. It wasn't without frustration and struggle, though. Perhaps an exaggeration of my capabilities and my plan for analysis helped. Then, trying to fulfill the promise I made myself on producing something not so dumb out of microdata downloads, I chose the easy way out to compare population pyramids from the DHS and the most recent population results. Such was the theme of my last post.

How about trying my hand at a more refined graphing of complex data?

Looking around I found The World Bank Group Country Survey FY 2014 and promptly downloaded the microdata from the World Bank Microdata portal here. Unlike the microdata for the DHS, you don't need to make a formal request for it. All you need to do is to accept the “Terms and conditions” for the microdata by clicking the “Accept” button on that page, choose the data file format, and start downloading.

The Myanmar Country Opinion Survey is part of the County Opinion Survey Program series of the World Bank Group. It was designed to achieve the following objectives (Myanmar: The World Bank Group Country Survey FY 2014, Report of Findings, November 2014):

Assist the World Bank Group in gaining a better understanding of how stakeholders in Myanmar perceive the Bank Group;
Obtain systematic feedback from stakeholders in Myanmar regarding:
Their views regarding the general environment in Myanmar;
Their overall attitudes toward the World Bank Group in Myanmar;
Overall impressions of the World Bank Group’s operations, knowledge work and activities, and communication and information sharing in Myanmar;
Perceptions of the World Bank Group’s future role in Myanmar.
Use data to help inform Myanmar country team’s strategy.”

Its methodology and scope of the survey were described as:

Between June and August 2014, 662 stakeholders of the World Bank Group in Myanmar were invited to provide their opinions on the WBG’s work in the country by participating in a country opinion survey. Participants were drawn from the office of the President, Prime Minster; office of a minister; office of a parliamentarian; ministries/ministerial departments; consultants/contractors working on WBG-supported projects/programs; PMUs overseeing implementation of a project; local government officials; bilateral and multilateral agencies; private sector organizations; private foundations; the financial sector/private banks; NGOs; community based organizations; the media; independent government institutions; trade unions; faith-based groups; academia/research institutes/think tanks; judiciary branch; and other organizations. A total of 173 stakeholders participated in the survey (26% response rate).

Respondents received and returned questionnaires through the courier service. Respondents were asked about: general issues facing Myanmar; their overall attitudes toward the WBG; the WBG’s importance and results; the WBG’s knowledge work and activities; the WBG’s future role in Myanmar; and the WBG’s communication and information sharing.”
Quickly going over the contents I was most interested in knowing what the stakeholders considered

'the top three most important development priorities, which areas the government should focus on, which areas would contribute most to reducing poverty and generating economic growth in Myanmar, and how “shared prosperity” would be best achieved'.

Chapter-IV and General Issues sections of Appendix-A and Appendix-B reported findings directly relevant to my interest. I was interested in understanding how the stakeholders view the development priorities and the similarities and differences within the individuals and groups. For that matter, I couldn't imagine an effective way to summarize the three responses given by each of the respondent in the form of tables of data so that they would bring out patterns across the three responses and across individuals/groups. In our case the respondents were to pick three development priorities out of a list of 35 and I guess the most appropriate way to see the patterns in the data is to draw a parallel coordinate plot. As human beings we can't see more than three dimensions, but using parallel coordinates we can see multiple dimensions by representing all the dimensions in just two dimensions through parallel coordinates!

The following is the famous Fisher's Iris data plotted this way. You can find it under the entry “Parallel Coordinates” in Wikipedia.

In parallel coordinates, the idea basically is to have as many y-axes as the number of dimensions and connect the points on these axes for each individual, case or observation.

In the following plot I used the downloaded World Bank Country Survey data (Reference ID: MMR_2014_WBCS_v01_M; Country: Myanmar; Producer:Public Opinion Research Group - The World Bank Group) to create the parallel coordinates plot for the three Development Priorities picked by each stakeholder. Each line in the plot shows the response of each stakeholder and is identified with a particular color to show the stakeholder group which he/she belongs. I used R to process the data and the graphs were plotted with the ggplot2 package.

The first plot is for all the nine stakeholder groups. One thing clearly seen from this plot is that one respondent has gone over the three-response limit to give the fourth one. While the rest of the respondents gave three responses, one gave only one response and another respondent give only two responses.


As we can see, it is not easy to distinguish different stakeholder groups in the above plot because there were too many of them and the colors become hard to distinguish. To overcome this we could try making a particular group stands out by plotting it against the rest of the groups. The following is the plot of Media group vs. the rest made that way.


So far it looks as if the message from the plot is clear. You can see each respondent's pattern of response as well as the collective pattern for the group. From such plots it is relatively easy to judge if one group is more homogeneous or not in terms of responses or get an idea of the pattern of response for most popular development priorities, and so on. But I've the uneasy feeling that opinion surveys may suffer from the problem of people unwilling to speak out their minds. That was found to be true, for example, even in the case of exit polls which, I felt, typically have the most insensitive questions. Another problem is the low response rates.

The World Bank invited 662 stakeholders to participate in their opinion survey. It was a pity that only 173 (23%) responded. Yet 3 stakeholders didn't have any answer for the Development Priority question (item a2_1 to a2_35) and out of the remaining 170 stakeholders, 22 didn't answer the question on which stakeholder group they belong. So my plots just covered 148 stakeholders (22%) out of 662.

At the beginning I felt rather uneasy about the low percentage of response (22 or 23%) of this survey. That means users may have to look beyond the survey data to make up their minds if the data reflect the opinions of the respective stakeholder groups. Anyway I am pretty sure that parallel coordinate plots are highly suitable for analyzing high dimensional data. Also, my primary intention was not the interpretation and making sense out of the analysis results. It was just a modest ambition of sharing my do-it-yourself experience. As this sharing would have my “easier done than said” twist, I went on happily doing my parallel coordinate plots.


All the plots for the World Bank Opinion Survey shown in this post were based on the microdata mentioned earlier. The microdata included the codes for nine stakeholder groups used in the analyses for the survey report. The number of respondents who responded to the question on which stakeholder group they belong as well as the question on development priorities were:
The plots would have been more readable if the Development Priority items were in full text instead of their codes. However that would take too much space and would leave the actual plot area too small. So we will have to refer to the list below:
I find ggplot2 not easy to learn but it works great. I don't know much about R graphics or particularly ggplot2 for that matter. But I guess ggplot2 produces the prettiest complex graphics from your data.

I have zero experience of ggplot2 before I started working on these plots. I worked through trial and error with the help of various tutorials and question-answers from Cross Validated and Stack Overflow, among others. My errors were much more than my trials, as one cartoon character said. But I've finally made it.