Monday, July 17, 2017

Playing with microdata II: my first parallel coordinates plot


This post gives “how to” of my first parallel coordinates plot with ggplot2. It wasn't my very first parallel coordinates plot because the very first was done with “parcoord” function from the MASS package. I also tried “parallelplot” of the “lattice” package, as well as “ggparallel” from the “ggparallel” package. You can't get parallel coordinate plot from the last one. I used it to draw “parallel sets plot” which is different though it is related to the parallel coordinates.

I was playing with the first two of those methods and still couldn't get the attributes of the plots such as headings, labels, and legends right when I stumbled upon “PARALLEL COORDINATE PLOTS FOR DISCRETE AND CATEGORICAL DATA IN R — A COMPARISON” by Markus Conrad in WZB Data Science Blog, from here.

The plot below

was produced by


and it needed a special type of data in “long format” using melt( ) funtion from the reshape2 package.

I thought I understood the long format idea. So I tried to do it on my own without using melt function from the reshape2 package. It worked, though the coding may have been really crude!

#-- myint thann, July 13, 2017
#-- extract data on development priorities from the microdata
#-- (Reference ID: MMR_2014_WBCS_v01_M; Country: Myanmar;
#   Producer:Public Opinion Research Group - The World Bank Group)
#-- and run first parallel coordinate plot with ggplot2

#-- import data
library(foreign)
x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")

# ---- extract and manipulate data
# (1) extract data on development priorities + stakeholder groups
xa2g1r <- x[,c(1,3:37,447)]
# (2) convert factors to integer
xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)
# (3) convert NAs to 0 in development priorities
xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0
# (4) convert 2 to 0 in development priorities
xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
# (5) remove persons with no response in development priorities
xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]
# (6) create dataframe of necessary variables for analysis
idg1 <- data.frame(row0=as.integer(row.names(xa2g1r)),xa2g1r[,c(1,37)])
row.names(idg1) <- 1:170
idg1$row <- as.integer(row.names(idg1))
# ---- end extract and manipulate data

## ---- create long form dataframe for use with ggplot2
# (7) get row, column ids of responses = 1 
#     create df, add variables for response 1,2,3
w <- which(xa2g1r[1:170,2:36]==1,arr.ind=TRUE)
w <- w[order(w[,1],w[,2]),]
wdf <- data.frame(w)
wdf$r1 <- 0
wdf$r2 <- 0
wdf$r3 <- 0
wdf$rid <- 0
# (8) make rid (response-id)
RF12 <- function (x) {
        col.x <- subset(wdf,row==x)$col
        xrid <- as.integer(as.factor(col.x))
}
wdf$rid <- unlist(apply(as.array(1:170),1,RF12))
# (9) create the list of dataframes for creating long form data for ggplot2
RF3 <- function (x) {
      r.x <- subset(wdf,row==x)
      r.x[,3] <- r.x[r.x$rid == 1,2]
      r.x[,4] <- r.x[r.x$rid == 2,2]
      r.x[,5] <- r.x[r.x$rid == 3,2]
      return(r.x)
}
z <- (apply(as.array(1:170),1,RF3))
# (9) make one dataframe from the list of 170 data frames
zdf.0 <- do.call("rbind",z)
# (10) remove duplicate rows
zdf <- zdf.0[zdf.0$rid==1,]
# (11) replace NA with 0
zdf[is.na(zdf)] <- 0
# (12) replicate rows to equal number of responses
#      for cases with less than or more than 3 responses
RF4 <- function (x) {
      z.x <- z[[x]]
      m <- nrow(z.x)
      xr <- row.names(z.x[m,])
      t <- ifelse(m==1, 2, 
                  ifelse(m==2, 1, 0))
      z.e <- z.x[rep(xr,t),]
      z.e <- rbind(z.x,z.e)
      return(z.e)
}
ze <- (apply(as.array(c(1:170)),1,RF4))   
# (13) make rid in replicated rows = (1,2,3, ...) and col=0
r123 <- function(x) {
      if (sum(ze[[x]]$rid)==5){
          ze[[x]][3,2] <- 0
          ze[[x]]$rid <- c(1,2,3)} else
      if (sum(ze[[x]]$rid)==3){ 
          ze[[x]][2,2] <- 0
          ze[[x]][3,2] <- 0
          ze[[x]]$rid <- c(1,2,3)} else          
      if (sum(ze[[x]]$rid)==6){} else
          {ze[[x]]$rid <- seq(1,nrow(ze[[x]]),1)}            
      return(ze[[x]])
} 
ze1 <- apply(as.array(1:170),1,r123)  
# (14) make data frame
zedf <- do.call("rbind",ze1) 
# (15) add stakeholder group = g1r to zedf        
pcdf <- merge(idg1,zedf,by="row")
# (16) add stakeholder group code
pcdf$sgcode <- factor(as.integer(pcdf$g1r))        
## ---- end create long form dataframe  for ggplot2

### ---- (17) run first par coord plot
library(ggplot2) 
y_levels <- levels(factor(1:35))
ggplot(pcdf, aes(x = rid, y = col, group = row)) +
       geom_path(aes(size = NULL, color = sgcode),
              alpha = 0.5,
              lineend = 'round', linejoin = 'round')+
       scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
       scale_size(breaks = NULL, range = c(1, 7)) 
### ---- end first par coord plot with ggplot2

From my R-script above the steps from (1) to (16) produces the dataframe pcdf which looks like this:

From this we need only the variables rid, col, row and sgcode. I just plugged them in the in the right places in Markus's code and voila!


For the benefit of my fellow dummies (or shall I call them citizen scientists?) I will confide to them that I struggled for a long time before I got to see a plot and then the right plot. That was before I realize that the variable for “group =” has to be the variable that specifies the observations in a given case that are to be connected by lines (for my dataframe it was “row”). Hasn't Markus emphasized “# group = id is important!”?

The above plot was the prototype from which … I worked through trial and error with the help of various tutorials and question-answers from Cross Validated and Stack Overflow, among others … as I noted in my last post. Thanks to Marus and other guys I manged to get the parallel coordinate plots shown there.


Now is your turn.

No comments:

Post a Comment