Tuesday, June 13, 2017

Playing with published data


The easiest way to try to make some sense out of data published in reports may be to make graphs and/or sort the data to see if there is some recognizable pattern. It is normal to see almost all statistical reports included graphs to try to communicate and highlight their notable results. Arranging the data in a particular theme in ascending or descending order is less common but some extensive statistical reports like a census report produce tables of data with some variables of interest sorted by provinces or some other administrative regions with the region with highest score at the top and that with lowest score at the bottom.

Graphing from data and sorting data from data in some electronic format could both be done quite well with Excel or some other spreadsheet. But only if you know how to get them into your spreadsheet first. The easiest way to do it would be when you could download the data in what is called a data file with “csv” extension. A Comma-Separted-Value (csv) data file can be opened with a spreadsheet directly and then saved as an Excel file or other spreadsheet file if you wish. Now smarter people will use what is known as an “api” or application programming interface provided by the data provider to get data from the Internet.

For me playing with published data should be as natural as children playing with sand on the beach. Here in Myanmar it used to be that not all data of public interest collected from people were published, or if published they were not always publicly available. In one instance, I was even told by a government official that I couldn't cite data from their published report. It was fifteen years ago, though, and I was officially collecting some indicators for the ASEAN Secretariat. If knowledge is power, it certainly wasn't people's power, I thought. Hopefully everything has changed now as exemplified by the offer of DHS microdata to researchers when the results came out recently.

For generating data of public interest, Demographic and Health Survey (DHS) usually supported by USAID and Multiple Indicators Cluster Survey (MICS) supported by UNICEF are the two most popular surveys in the developing countries. We Myanmars have seen a series of MICS being carried out in our country but the DHS Myanmar 2015-16 is the first DHS for us and its report is available here.

The good news is that you could register and request the microdata for this DHS apart from macro data and information available freely through the reports and the hosted database.

Quickly going over the DHS report, I noticed what seemed to be a somewhat strange shape of the population pyramid. I visually compared it with the population pyramid of the Myanmar Population Census 2014 report.


It seems like the age range 15-29 in DHS is pinched appreciably. So I tried putting the two pyramids on the same graph as shown below.


Why this difference? We know that a large number from the working age population has gone out to work in neighboring countries and elsewhere, and therefore it seems like the reduced proportion of 15-34 age range in males and 15-29 in females may be explained by this fact. However, we note that the populations graphed by both the DHS and Census were enumerated on the same defacto basis. So it seems like the population 15-34 has been genuinely under represented in the DHS. However, I am not smart enough to understand if that would have significant impact on the the survey results or not. Surely real statisticians could make it understandable to the ordinary folks.

Anyway, my real intention is to share a few dumb techniques I picked up to extract data from postscript or pdf files to csv files and make them into graphs after reading them into a spreadsheet. The particular type of graph we are going to make is a population pyramid which many of the users of Excel or other spreadsheets may not be familiar and find it hard to do.

My target audience is the fellow dummies in Myanmar. More widely the general public. Here in Myanmar most still use pirated software, I guess. But, pardon me, I have been using freeware or opensource software for quite a long time. And I strongly encourage others to do so. Why should we be enslaved? Still?

First we'll extract data from a document in pdf format. Normally we open such document with Adobe Reader which comes free. If you want to extract a picture of a data table in that document to put it in your report, you could easily use the “Take a snapshot” tool. But there is no way you could cut a table and then paste it somewhere using the Adobe Reader. To do so I used the Tabula software available for free here. To manipulate and graph the extracted csv data you need a spreadsheet and I recommend Open Office Calc spreadsheet which is a component of the Open Office software available here. Open Office is a dependable alternative to the Microsoft Office software and it is completely free.

The materials used will come from the DHS report available here and the Myanmar Population Census report available here.

I created the graph of population pyramid from DHS report superimposed on the population pyramid from the Population Census report that is shown above with the following steps:
Step-1: Run the Tabula software by double clicking its shortcut.
I assume that you have installed the Tabula software and know how to create the shortcut on the desktop of your PC. Now Tabula will be opened and waits for you to specify the pdf files to import. Note that I've specified the names of the two above files and clicked the Import button in the following screenshot.


Now you click the green button to extract data. Here you scroll down to page-52 in the DHS report where the Table 2.7 for population pyramid data is and select the data to be extracted (area shown in pink).


Then click the green button and you get the data preview.


Is the data OK? No! The first column puts “Age” and “Male” together. Well, read the hints given in the text on the left-most side of the Tabula screen and try to adjust the selection so that you get the preview right.


Then saved the data as csv by clicking the Export button.

Step-2: Manipulate data in spreadsheet.
Now I opened the resulting csv file with Open Office Calc:


Similarly I exported the Table A-6a giving the population by five-year age groups and sex from the Census report to csv format and opened with Open Office Calc. Collecting the two data sets in a spreadsheet and after the extraction of columns and some arithmetic, I get the following table that has all the information needed to create my population pyramid:


Step-3: Graph it.
The above table shows percentages of males and females of different age groups in their own groups. Notice that males are shown as negative percentages. This is because we need to use a bar chart to draw a population pyramid with males on the left of the X-axis.

Another trick is to align the bars for males(left) and females(right) in the same age group perfectly level. You do this by overlapping bars by 100%.


The final trick is to show percentage labels -15, -10, -5 without the minus signs by changing the number format of the Y-axis.


After adding the title and adjusting colors, lines, font, legend and so on, you should get the final graph: