This sample of parallel sets plots gives us colorful crisp images with clear messages to match. Their favorite theme of illustration seemed to be the casualties of the Titanic disaster and as I can see parallel sets spares you the effort to create a vivid mental image of the tragic true story. This is from Kosara_BeautifulVis_2010.pdf .
My attempt at visualization of development priorities by stakeholder groups which has been the theme of my last two posts, now using parallel sets, wasn't so clear:
This may be because every stakeholder respondent has the choice of up to three development priorities out of the given thirty-five, making the result too complex for clear visualization. Justin in Games of Thrones Parallel Sets Data Visualization, gives his rule of thumb:
What can we do with our complex data? May be we could follow the lead of Markus (from my last post):
What can we do with our complex data? May be we could follow the lead of Markus (from my last post):
His idea is to simplify the data by taking the top 10 distinct three-response sets by number of occurrences. For this we first need to convert the dataframe based on individual respondents to grouped dataframe with distinct response triplets (r1-r2-r3) and their frequencies (n), like this:
r1 r2 r3 n
42 11 19 20 5
65 19 20 21 5
9 4 5 11 3
61 11 20 21 3
121 18 20 34 3
10 3 9 11 2
23 4 5 20 2
26 3 9 20 2
28 2 13 20 2
33 4 14 20 2
This resulted in the parallel sets plot:
So far so good. But the problem is that the top-10 response categories just covered 29 respondents out of 170 (17%), and biggest n is 5 only! Also, going down the rows of the dataframe is useless because we'll be getting down to frequencies of 1.
This is the script for producing my two parallel sets plots:
# parallel sets plot using ggparallel package
# myint thann, July 28, 2017
# import data
library(foreign)
x <- read.dta("myanmar_cs_fy14_datafile_with_dk_.dta")
# ---- extract and manipulate data
# (1) extract data on development priorities + stakeholder groups
xa2g1r <- x[,c(1,3:37,447)]
# (2) convert factors to integer
xa2g1r[,2:36] <- lapply(xa2g1r[,2:36], as.integer)
# (3) convert NAs to 0 in development priorities
xa2g1r[,2:36][is.na(xa2g1r[,2:36])] <- 0
# (4) convert 2 to 0 in development priorities
xa2g1r[,2:36][xa2g1r[,2:36]==2] <- 0
# (5) remove persons with no response in development priorities
xa2g1r <- xa2g1r[rowSums(xa2g1r[,2:36])>0,]
# ---- end extract and manipulate data
# (6) create response data
xa2g1r$r1 <- 0
xa2g1r$r2 <- 0
xa2g1r$r3 <- 0
# (7) create first, second, and third responses
for (j in 1:170) {
k = 38
for(i in 2:36){
if (xa2g1r[j,i]==1) {
xa2g1r[j,k] <- i
k=k+1
}
if (k > 40) break
}
}
### run parallel sets plot
# (8) run parallel sets plot for all responses
library(ggparallel)
ggparallel(vars = list("r1", "r2", "r3"), xa2g1r, method="angle")+
coord_flip()
# save plot
ggsave("perset_all.png", width = 5, height = 5)
# run parallel sets plot for top ten most frequent responses
# (9) create grouped data frame with number of respondents
# for each distinct set of three responses
xa2g1r_G <- xa2g1r[,38:40]
xa2g1r_G <- aggregate(list(n=rep(1,nrow(xa2g1r_G))), xa2g1r_G, length)
xa2g1r_G10 <- xa2g1r_G[order(xa2g1r_G$n, decreasing=TRUE),][1:10,]
# (10) run parallel sets plot for top ten responses
ggparallel(list('r1', 'r2', 'r3'), xa2g1r_G10, weight = 'n', order = 0)
# save plot
ggsave("perset_top10.png", width = 5, height = 5)
Notes:
- In step (7) of my script above, I used loops to create data for regular dataframe with r1, r2, and r3. Attempting to avoid loops, I needed 10 steps to arrive at the equivalent dataframe used in my last two posts, poor me. You could certainly do that better.
- Markus's code for producing the grouped data frame used the dplyr package:
I used one of the three answers to
Find
how many times duplicated rows repeat in R data frame in
stackoverflow
by
thelatemail
that uses the aggregate
function
(see step-(9) in my R script above). Kudos to the versatility of R
and the power of Q/A sites.
No comments:
Post a Comment