This post gives “how to” of my
first parallel coordinates plot with ggplot2. It wasn't my very
first parallel coordinates plot because the very first was done
with “parcoord” function from the MASS package. I also tried
“parallelplot” of the “lattice” package, as well as
“ggparallel” from the “ggparallel” package. You can't get
parallel coordinate plot from the last one. I used it to draw
“parallel sets plot” which is different though it is
related to the parallel coordinates.
I was playing with the first two of
those methods and still couldn't get the attributes of the plots such
as headings, labels, and legends right when I stumbled upon “PARALLEL
COORDINATE PLOTS FOR DISCRETE AND CATEGORICAL DATA IN R — A
COMPARISON” by Markus Conrad in WZB Data Science Blog, from here.
The plot below
was produced by
and it needed a special type of data in
“long format” using melt( ) funtion from the reshape2
package.
I thought I understood the long format
idea. So I tried to do it on my own without using melt
function from the reshape2 package. It worked, though the
coding may have been really crude!
#-- myint thann, July 13, 2017
#-- extract data on development priorities from the microdata
#-- (Reference ID: MMR_2014_WBCS_v01_M; Country: Myanmar;
# Producer:Public Opinion Research Group - The World Bank Group)
#-- and run first parallel coordinate plot with ggplot2
#-- import data
library(foreign)
x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")
# ---- extract and manipulate data
# (1) extract data on development priorities + stakeholder groups
xa2g1r <- x[,c(1,3:37,447)]
# (2) convert factors to integer
xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)
# (3) convert NAs to 0 in development priorities
xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0
# (4) convert 2 to 0 in development priorities
xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
# (5) remove persons with no response in development priorities
xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]
# (6) create dataframe of necessary variables for analysis
idg1 <- data.frame(row0=as.integer(row.names(xa2g1r)),xa2g1r[,c(1,37)])
row.names(idg1) <- 1:170
idg1$row <- as.integer(row.names(idg1))
# ---- end extract and manipulate data
## ---- create long form dataframe for use with ggplot2
# (7) get row, column ids of responses = 1
# create df, add variables for response 1,2,3
w <- which(xa2g1r[1:170,2:36]==1,arr.ind=TRUE)
w <- w[order(w[,1],w[,2]),]
wdf <- data.frame(w)
wdf$r1 <- 0
wdf$r2 <- 0
wdf$r3 <- 0
wdf$rid <- 0
# (8) make rid (response-id)
RF12 <- function (x) {
col.x <- subset(wdf,row==x)$col
xrid <- as.integer(as.factor(col.x))
}
wdf$rid <- unlist(apply(as.array(1:170),1,RF12))
# (9) create the list of dataframes for creating long form data for ggplot2
RF3 <- function (x) {
r.x <- subset(wdf,row==x)
r.x[,3] <- r.x[r.x$rid == 1,2]
r.x[,4] <- r.x[r.x$rid == 2,2]
r.x[,5] <- r.x[r.x$rid == 3,2]
return(r.x)
}
z <- (apply(as.array(1:170),1,RF3))
# (9) make one dataframe from the list of 170 data frames
zdf.0 <- do.call("rbind",z)
# (10) remove duplicate rows
zdf <- zdf.0[zdf.0$rid==1,]
# (11) replace NA with 0
zdf[is.na(zdf)] <- 0
# (12) replicate rows to equal number of responses
# for cases with less than or more than 3 responses
RF4 <- function (x) {
z.x <- z[[x]]
m <- nrow(z.x)
xr <- row.names(z.x[m,])
t <- ifelse(m==1, 2,
ifelse(m==2, 1, 0))
z.e <- z.x[rep(xr,t),]
z.e <- rbind(z.x,z.e)
return(z.e)
}
ze <- (apply(as.array(c(1:170)),1,RF4))
# (13) make rid in replicated rows = (1,2,3, ...) and col=0
r123 <- function(x) {
if (sum(ze[[x]]$rid)==5){
ze[[x]][3,2] <- 0
ze[[x]]$rid <- c(1,2,3)} else
if (sum(ze[[x]]$rid)==3){
ze[[x]][2,2] <- 0
ze[[x]][3,2] <- 0
ze[[x]]$rid <- c(1,2,3)} else
if (sum(ze[[x]]$rid)==6){} else
{ze[[x]]$rid <- seq(1,nrow(ze[[x]]),1)}
return(ze[[x]])
}
ze1 <- apply(as.array(1:170),1,r123)
# (14) make data frame
zedf <- do.call("rbind",ze1)
# (15) add stakeholder group = g1r to zedf
pcdf <- merge(idg1,zedf,by="row")
# (16) add stakeholder group code
pcdf$sgcode <- factor(as.integer(pcdf$g1r))
## ---- end create long form dataframe for ggplot2
### ---- (17) run first par coord plot
library(ggplot2)
y_levels <- levels(factor(1:35))
ggplot(pcdf, aes(x = rid, y = col, group = row)) +
geom_path(aes(size = NULL, color = sgcode),
alpha = 0.5,
lineend = 'round', linejoin = 'round')+
scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
scale_size(breaks = NULL, range = c(1, 7))
### ---- end first par coord plot with ggplot2
From my R-script above the steps from
(1) to (16) produces the dataframe pcdf which looks like this:
From this we need only the variables
rid, col, row and sgcode. I
just plugged them in the in the right places in Markus's code and
voila!
For the benefit of my fellow dummies
(or shall I call them citizen scientists?) I will confide to them
that I struggled for a long time before I got to see a plot and then
the right plot. That was before I realize that the variable for
“group =” has to be the variable that specifies the observations
in a given case that are to be connected by lines (for my dataframe
it was “row”). Hasn't Markus emphasized “
#
group = id is important!
”?
The above plot was the prototype from
which … I worked through trial and error with the help of
various tutorials and question-answers from Cross Validated and Stack
Overflow, among others … as I noted in my last post. Thanks to
Marus and other guys I manged to get the parallel coordinate plots
shown there.
Now is your turn.
No comments:
Post a Comment