Wednesday, December 6, 2017

R Notebook version of my parallel coordinates plots

This is the R Markdown Notebook version of parallel coordinates plots displayed in my post “Playing with microdata”. When you execute code within the notebook, the results appear beneath the code.
For the following code to run, you need to have, (i) downloaded the stata data file “myanmarcs_fy14_datafile_with_dk.dta” from the World Bank site, (ii) after running the step (16) of the code in my previous post “Playing with microdata II: my first parallel coordinates plot”, you have saved the resulting data frame to “pcdf.RData”, and (iii) it exists in the directory for the R Notebook project in RStudio.
load("pcdf.RData")
str(pcdf)
## 'data.frame':    511 obs. of  10 variables:
##  $ row   : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ row0  : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ id    : int  101 101 101 102 102 102 103 103 103 104 ...
##  $ g1r   : Factor w/ 9 levels "Office of the President/ Prime Minster/ Minister",..: NA NA NA 5 5 5 1 1 1 5 ...
##  $ col   : num  18 19 20 7 13 20 7 8 0 7 ...
##  $ r1    : int  18 18 18 7 7 7 7 7 7 7 ...
##  $ r2    : int  19 19 19 13 13 13 8 8 8 13 ...
##  $ r3    : int  20 20 20 20 20 20 NA NA NA 20 ...
##  $ rid   : num  1 2 3 1 2 3 1 2 3 1 ...
##  $ sgcode: Factor w/ 9 levels "1","2","3","4",..: NA NA NA 5 5 5 1 1 1 5 ...
head(pcdf)
##   row row0  id                                            g1r col r1 r2 r3
## 1   1    1 101                                           <NA>  18 18 19 20
## 2   1    1 101                                           <NA>  19 18 19 20
## 3   1    1 101                                           <NA>  20 18 19 20
## 4   2    2 102 Private Sector/ Financial Sector/ Private Bank   7  7 13 20
## 5   2    2 102 Private Sector/ Financial Sector/ Private Bank  13  7 13 20
## 6   2    2 102 Private Sector/ Financial Sector/ Private Bank  20  7 13 20
##   rid sgcode
## 1   1   <NA>
## 2   2   <NA>
## 3   3   <NA>
## 4   1      5
## 5   2      5
## 6   3      5
In my post “Playing with microdata II: my first parallel coordinates plot”, the plot at the bottom of the post was created by running the following code chunk (Note that to get the desired size of plot we use something like “”{r fig.height=6, fig.width=6}“ for a given code chunk):
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
y_levels <- levels(factor(1:35))
ggplot(pcdf, aes(x = rid, y = col, group = row)) +
geom_path(aes(size = NULL, color = sgcode),
alpha = 0.5,
lineend = 'round', linejoin = 'round')+
scale_y_discrete(limits = y_levels, expand = c(0.5, 0)) +
scale_size(breaks = NULL, range=c(1,7))
plot of chunk unnamed-chunk-3
Then I left it there for the reader to try to create plots like the one shown in the last part of my post - "Playing with microdata”.In fact this last graphic consisted of six separate plots which I didn't find an easy way to combine into one page using ggplot2. To dodge this issue I just combined them into a single graphic usig GIMP! I think this is fine for the time being, because my primary purpose is blogging. But for sharing my R code, I should learn to place multiple plots produced by ggplot2 on one page. One promising way here would be to use the multiplot function given in: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/, and there maybe others.
The main idea for improving the above plot would be to select one stakeholder group and plot the lines in this group in one color infront of all other groups in another color. To do so (i) we create Y-axis label to represent codes for development priorities, (ii) we define the order of drawing two groups of lines, (iii) and for the readability of the plots create legends for the line colors with text wrapping, and (iv) add appropriate axis labels and headings.

Create Y-axis labels

# create data frame of response codes to use with ggplot
YL <- data.frame(A=as.character(rep("a2_",35)), n=as.character(seq(1,35,1)))
q <- paste(YL$A,YL$n,sep="")
YL$rcode <- factor(q,levels=q)
head(YL)
##     A n rcode
## 1 a2_ 1  a2_1
## 2 a2_ 2  a2_2
## 3 a2_ 3  a2_3
## 4 a2_ 4  a2_4
## 5 a2_ 5  a2_5
## 6 a2_ 6  a2_6

Create StakeHolder group names with text wrapping

pcdf$SGname <- gsub("/","/ \n",pcdf$g1r)
sgn <- gsub("/","/ \n", levels(pcdf$g1r))
pcdf$SGname <- factor(pcdf$SGname, levels= sgn)
head(pcdf$SGname)
## [1] <NA>                                                
## [2] <NA>                                                
## [3] <NA>                                                
## [4] Private Sector/ \n Financial Sector/ \n Private Bank
## [5] Private Sector/ \n Financial Sector/ \n Private Bank
## [6] Private Sector/ \n Financial Sector/ \n Private Bank
## 9 Levels: Office of the President/ \n Prime Minster/ \n Minister ...

Create new variables in which one stakeholder group name is preserved and other groups are collapsed into “All Others”

pcdf$SG_1 <- ifelse(as.integer(pcdf$g1r) == 1,
    levels(pcdf$SGname)[1], "All Others")
pcdf$SG_2 <- ifelse(as.integer(pcdf$g1r) == 2,
     levels(pcdf$SGname)[2], "All Others")
pcdf$SG_3 <- ifelse(as.integer(pcdf$g1r) == 3,
    levels(pcdf$SGname)[3],
    "All Others")
pcdf$SG_5 <- ifelse(as.integer(pcdf$g1r) == 5,
    levels(pcdf$SGname)[5],
    "All Others")
pcdf$SG_6 <- ifelse(as.integer(pcdf$g1r) == 6,
    levels(pcdf$SGname)[6], "All Others")
pcdf$SG_7 <- ifelse(as.integer(pcdf$g1r) == 7,
    levels(pcdf$SGname)[7], "All Others")
head(pcdf)
##   row row0  id                                            g1r col r1 r2 r3
## 1   1    1 101                                           <NA>  18 18 19 20
## 2   1    1 101                                           <NA>  19 18 19 20
## 3   1    1 101                                           <NA>  20 18 19 20
## 4   2    2 102 Private Sector/ Financial Sector/ Private Bank   7  7 13 20
## 5   2    2 102 Private Sector/ Financial Sector/ Private Bank  13  7 13 20
## 6   2    2 102 Private Sector/ Financial Sector/ Private Bank  20  7 13 20
##   rid sgcode                                               SGname
## 1   1   <NA>                                                 <NA>
## 2   2   <NA>                                                 <NA>
## 3   3   <NA>                                                 <NA>
## 4   1      5 Private Sector/ \n Financial Sector/ \n Private Bank
## 5   2      5 Private Sector/ \n Financial Sector/ \n Private Bank
## 6   3      5 Private Sector/ \n Financial Sector/ \n Private Bank
##         SG_1       SG_2       SG_3
## 1       <NA>       <NA>       <NA>
## 2       <NA>       <NA>       <NA>
## 3       <NA>       <NA>       <NA>
## 4 All Others All Others All Others
## 5 All Others All Others All Others
## 6 All Others All Others All Others
##                                                   SG_5       SG_6
## 1                                                 <NA>       <NA>
## 2                                                 <NA>       <NA>
## 3                                                 <NA>       <NA>
## 4 Private Sector/ \n Financial Sector/ \n Private Bank All Others
## 5 Private Sector/ \n Financial Sector/ \n Private Bank All Others
## 6 Private Sector/ \n Financial Sector/ \n Private Bank All Others
##         SG_7
## 1       <NA>
## 2       <NA>
## 3       <NA>
## 4 All Others
## 5 All Others
## 6 All Others
# convert to factors
pcdf[,12:17] <- lapply(pcdf[,12:17], as.factor)
# change levels of factors
pcdf[,12] <- relevel(pcdf[,12], ref=levels(pcdf[,12])[2])
pcdf[,13] <- relevel(pcdf[,13], ref=levels(pcdf[,13])[2])
pcdf[,14] <- relevel(pcdf[,14], ref=levels(pcdf[,14])[2])
pcdf[,15] <- relevel(pcdf[,15], ref=levels(pcdf[,15])[2])
pcdf[,16] <- relevel(pcdf[,16], ref=levels(pcdf[,16])[2])
pcdf[,17] <- relevel(pcdf[,17], ref=levels(pcdf[,17])[2])
str(pcdf)
## 'data.frame':    511 obs. of  17 variables:
##  $ row   : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ row0  : int  1 1 1 2 2 2 3 3 3 4 ...
##  $ id    : int  101 101 101 102 102 102 103 103 103 104 ...
##  $ g1r   : Factor w/ 9 levels "Office of the President/ Prime Minster/ Minister",..: NA NA NA 5 5 5 1 1 1 5 ...
##  $ col   : num  18 19 20 7 13 20 7 8 0 7 ...
##  $ r1    : int  18 18 18 7 7 7 7 7 7 7 ...
##  $ r2    : int  19 19 19 13 13 13 8 8 8 13 ...
##  $ r3    : int  20 20 20 20 20 20 NA NA NA 20 ...
##  $ rid   : num  1 2 3 1 2 3 1 2 3 1 ...
##  $ sgcode: Factor w/ 9 levels "1","2","3","4",..: NA NA NA 5 5 5 1 1 1 5 ...
##  $ SGname: Factor w/ 9 levels "Office of the President/ \n Prime Minster/ \n Minister",..: NA NA NA 5 5 5 1 1 1 5 ...
##  $ SG_1  : Factor w/ 2 levels "Office of the President/ \n Prime Minster/ \n Minister",..: NA NA NA 2 2 2 1 1 1 2 ...
##  $ SG_2  : Factor w/ 2 levels "Office of Parliamentarian",..: NA NA NA 2 2 2 2 2 2 2 ...
##  $ SG_3  : Factor w/ 2 levels "Employee of a Ministry/ \n PMU/ \n Consultant on WBG project",..: NA NA NA 2 2 2 2 2 2 2 ...
##  $ SG_5  : Factor w/ 2 levels "Private Sector/ \n Financial Sector/ \n Private Bank",..: NA NA NA 1 1 1 2 2 2 1 ...
##  $ SG_6  : Factor w/ 2 levels "CSO","All Others": NA NA NA 2 2 2 2 2 2 2 ...
##  $ SG_7  : Factor w/ 2 levels "Media","All Others": NA NA NA 2 2 2 2 2 2 2 ...

Plot by all stakeholder groups

# plot responses for development priorities for all stakeholder groups
y_levels <- levels(YL$rcode)
ggplot(pcdf, aes(x = rid, 
        y = col, group=row))+ 
        labs(title = "General Issues Facing Myanmar:", 
            subtitle = "Development Priority")+
        xlab("Three responses")+
        ylab("Response code")+
        geom_path(aes(color = SGname), lineend='round',
            linejoin='round', size=0)+
            scale_y_discrete(limits = y_levels)+
            scale_size(breaks = NULL, range = c(1, 35))
plot of chunk unnamed-chunk-9

Define the order of drawing the groups of lines

The idea is from “ggplot2: Determining the order in which lines are drawn”: http://blog.mckuhn.de/2011/08/ggplot2-determining-order-in-which.html.
pcdf$o1 <- as.factor(apply(format(pcdf[,c("SG_1", "row")]), 1, paste, collapse=" "))
pcdf$o2 <- as.factor(apply(format(pcdf[,c("SG_2", "row")]), 1, paste, collapse=" "))
pcdf$o3 <- as.factor(apply(format(pcdf[,c("SG_3", "row")]), 1, paste, collapse=" "))
pcdf$o5 <- as.factor(apply(format(pcdf[,c("SG_5", "row")]), 1, paste, collapse=" "))
pcdf$o6 <- as.factor(apply(format(pcdf[,c("SG_6", "row")]), 1, paste, collapse=" "))
pcdf$o7 <- as.factor(apply(format(pcdf[,c("SG_7", "row")]), 1, paste, collapse=" "))
Previous plot shows that some respondents didn't identify the SG they belong.
So rows for pcdf$g1r with NA's have to be dropped from the data frame.
pcdf.1 <- pcdf[!is.na(pcdf$g1r),]
nrow(pcdf.1)
## [1] 445

Create plots with emphasis on a particular stakeholder group

Two plots were drawn for Stakeholder Groups SG_3 and SG_7 with headings, labels for axes, thicker line size, front line color blue and background line color yellow. This is done by using “scale_colour_manual(values=c('blue','yellow'))”. I was happy playing with many combinations of two different colors before I settled with blue and yellow.
Find out what staggering number of colors you could use by “colors()”. To know how ggplot2 define the colors of factors, Q/A such as these may be useful: https://stackoverflow.com/questions/46393082/ggplot2-why-is-color-order-of-geom-line-graphs-reversed, and https://stackoverflow.com/questions/9887342/ggplot2-plotting-order-of-factors-within-a-geom.
Legend is placed at the bottom of the plot. Plots for the remaining groups could be drawn by changing the group=, subtitle, and color= appropriately.
# for SG_3
plot <- ggplot((pcdf.1), aes(x = rid, 
        y = col, group=o3))+ 
        labs(title = "General Issues Facing Myanmar: \n Development Priority", 
            subtitle = "(Group-3 vs. All Others)")+
        xlab("Three responses")+
        ylab("Response code")+
        geom_path(aes(color = SG_3), lineend='round',
            linejoin='round', size=.8)+
            scale_colour_manual(values=c('blue','yellow'))+
            scale_y_discrete(limits = y_levels)+
            xlim("First","Second", "Third", "(Fourth)")+
            scale_size(breaks = NULL)
        plot + theme(legend.text = element_text(size = 8, hjust = .1, vjust = .1),
        legend.position = "bottom")
plot of chunk unnamed-chunk-12
# for SG_7
plot <- ggplot((pcdf.1), aes(x = rid, 
        y = col, group=o7))+ 
        labs(title = "General Issues Facing Myanmar: \n Development Priority", 
            subtitle = "(Stakeholder Group-7 vs. All Others)")+
        xlab("Three responses")+
        ylab("Response code")+
        geom_path(aes(color = SG_7), lineend='round',
            linejoin='round', size=.8)+
            scale_colour_manual(values=c('blue','yellow'))+
            scale_y_discrete(limits = y_levels)+
            xlim("First","Second", "Third", "(Fourth)")+
        scale_size(breaks = NULL)
        plot + theme(legend.text = element_text(size = 8, hjust = .1, vjust = .1),
        legend.position = "bottom")
plot of chunk unnamed-chunk-13