Thursday, September 24, 2015

R vs. XXX


Comparing R with some well-known commercial statistical software like SPSS, SAS, or Stata comes up time to time. From answers on stackoverflow five years ago, I like this one by Greg Snow and it still looks relevant:

When talking about user friendlyness of computer software I like the analogy of cars vs. busses:

Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare). Cars on the other hand require much more work, you need to have some type of map or directions (even if the map is in your head), you need to put gas in every now and then, you need to know the rules of the road (have some type of drivers licence). The big advantage of the car is that it can take you a bunch of places that the bus does not go and it is quicker for some trips that would require transfering between busses.

Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed.
R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the pasenger seat, and mountain climbing and spelunking gear in the back.

R can take you anywhere you want to go if you take time to leard how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.”

And he added:

There are GUIs for R that make it a bit easier to use, but also limit the functionality that can be used that easily. SPSS does have scripting which takes it beyond being a mere bus, but the general phylosophy of SPSS steers people towards the GUI rather than the scripts.”

The rest of the discussions you could read on, but my favorite is this one from a student I'd quoted in one of my earlier posts in connection with using R for econometrics. I am repeating here the answer that appeared on Quora in 2014.

Karem Tuzcuoglo, a PhD candidate in economics at Columbia explains:
"One-Click" Programs ((almost) no coding required, results obtained by one click)
STATA: Most of the econ undergrad programs use STATA. It is the best program (even at the PhD level) if you want to estimate panel data (i.e., where the data hava both cross sectional and time series dimension. Typical examples are surveys and international trade data sets).Eviews: Less famous than Stata, but provides much better time series analysis. If you don't want to do time series forget about Eviews.SPSS: I don't have much information about it. But I can tell that it's not widely used.
"Semi-Coding" Programs
SAS: It used to be a big deal 10-20 years ago. Right now not as famous as before - though there are some companies that still strictly prefer using SAS.R: Maybe the most popular program nowadays. First of all it's free! R network and R packages (pre-written algorithms by others) are getting larger and larger. Actually, R can be listed in the next section as well because one can definitely code everything in R. However, the fact that there are so many ready-to-use packages in R makes it also Semi-Coding program if one wants to.
"Pure-Coding" Programs
MATLAB: The most famous program among (high level) econometricians. Many applied economics have been done by Matlab. A lot of researchers put their Matlab codes online. It has a good Econometrics package - one still needs to code though.PYTHON: It's more powerful and faster than Matlab. However, it's a very new language; it's still developing. C++: If one wants to do hardcore coding, then C++ is the ultimate program. It's extremely fast in terms of computation (once, my simulation took 25 hours in Matlab, whereas C++ ran the code in 3-4 hours).FORTRAN: Professors above 55+ age will know this program. It's (almost) not used anymore - though we should show some respect to the Father of Coding Programs!
BONUS: There are several other programming languages of course. If you are in UK (especially in Oxford), you will end up using a program called Ox., which is an optimized program for matrix algebra and, thus, for econometrics. Gretl is an extremely easy to use - but less to offer- program.

Among all of the options, I would suggest you to learn R regardless of whether you want to work in academia or in industry (more and more companies begin using R by the way). But if you want to stay away from coding, then go for STATA.

A heated debate on R vs. SAS started in August 2012 in Cross-Validated with the question “R vs SAS, why is SAS preferred by private companies?" The last entry was on April 2014. Among the posters Frank Harrell is the maintainer of the Hmisc package in R. “The package contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX code, and recoding variables.”

Below I've biased myself by selecting only Frank Harrell's comments from the debate and have them taken out of context. However, I just wanted them to serve as teasers for the whole discussion.

  • I'm not sold by @PeterFlom 's point either. There are about 4000 packages in R. Not all have to be of the highest quality for add-on packages to have a net positive value. The number of reliable add-on packages exceeds the capabilities of SAS by a huge margin. (Aug 6 '12 at 20:06)
  • True, but it's hard to penalize a statistical computing system for its comprehensiveness. Or to say it another way, R's way of doing something is better than another system's way of not doing it. (Aug 7 '12 at 12:29)
  • I think these comments are not correct. In the server world, open source rules, and the Apache web server is the most popular web server. (Aug 11 '12 at 13:47)
  • I'm hoping that the 2nd edition of RMS will be available in just over a year. (Aug 12 '12 at 13:48)
  • I'm not familiar with that world but I suspect that scientists have more freedom than they think. (Aug 12 '12 at 13:49)
  • There is nothing that needs to be done to redo regulatory approval for the sake of switching to R. (Aug 11 '12 at 18:58)
  • What is the alternative to downloading a package that provides new capabilities (as most R packages do)? Is it to home grow those capabilities? Is that more reliable? (Aug 11 '12 at 18:59)
  • SAS comes with the same warranty as R: none. (Jan 8 '13 at 13:26)
  • Yes, people have to some extent discover R on their on. But much of the issue comes down to inertia of learning a new language. New languages are always coming out that have advantages over older languages yet users cling to the old languages (witness COBOL). Programming in SAS is hugely inefficient, requiring perhaps double the number of programmers to do the same job as R, but SAS experts are happy to hum along on their merry way and companies are afraid of the kind of disruption that would save them millions of dollars in salaries. (Jan 8 '13 at 13:33)
  • I don't follow your reasoning. The amount of money wasted paying programmers to program in an archaic language (SAS) vs. modern free languages is stunning. (Apr 15 '13 at 15:25)
  • Having used SAS for 23 years and S-Plus/R for 22 years I can say that a highly experienced SAS programmer can be highly productive, but that an experienced R programmer can be easily three times as productive. (Jun 10 '13 at 3:18)
Among the comments the following anti open source remark from the SAS representative was most remarkable and received an uproar. You should not miss reading the links provided.

... I think the worst anti open source quote I've heard was from SAS saying soemthing like 'would you trust a jumbo jet designed in open source, an engine might drop off'(PaulHurleyuk, Aug 7 '12 at 12:37)

@PaulHurleyuk: +1 The quote was “We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.” by a SAS marketing director in this New York Times article on R. The SAS representative clarified her remarks in a later blog post. (jthetzel, Aug 7 '12 at 12:53)

Note: The first link in above quote doesn't work. I looked for and found it here.

As usual, I would say the best way will be to read all these and more and then make up your mind on your own. As poor small guys we hardly need a second look.




Tuesday, September 22, 2015

Thakhin spirit and expansion to thousand lights


Only yesterday a friend told me that some young people from his nonpartisan research unit recently have the opportunity to learn SPSS. SPSS, as you know, is the Statistical Package for the Social Sciences and is close to a household word in Myanmar with people in some way connected with social sciences, or surveys, or statistics. Most of the time when I mentioned R, a complete environment for statistical computing, or CSPro, a survey data processing software, they would just ask: how does it compare with SPSS?

The point of the story is that this friend told me that the training was offered by some professionals from an industrial powerhouse nation in Asia, entirely free, and I suppose—with no strings attached. However, I was a bit worried. We have seen that when we began opening up not too long ago, a lot of swindlers disguised as businessmen came in by swarms (to make it a little more dramatic). I don't know how much they've squeezed out of our people, but later searching on the Web with the clues I heard of, I could identify that they were using the pyramid and ponzi schemes; may be more. I suspect they are still lurking somewhere.

Anyway I was carried away talking about our people being cheated. In the present context relating to our friend I am positive they are receiving a genuine transfer of know-how. Nevertheless, my concern with it is sustainability and the prospect for an expansion of the knowledge base. We should be aware that while learning to use SPSS or any other commercial software could be made free, the software itself is not free and successive sharing of knowledge would call for proportional expansion of financial resources. While such software could be free in the sense that some funding agency undertakes to license it for you for the time being, clearly the agency could not do it indefinitely, or respond to ever growing needs.

So, you have two options: (A) when the honey moon is over, or when you want to expand a significant number of computers on which SPSS has been installed initially, you could resort to using pirated copies of the software, or (B) use some open-source or free software from the beginning. Looking at the price list of SPSS just now, I found a lot of complicated arrangements for licensing. As far as I could understand, there are four package configurations with starting prices in (US dollars per user): Base ($1140) ; Standard ($2530); Professional ($5090); and Premium ($7590). With each of them it seems you get software support for only 12 months, but you could use the software indefinitely. Here, it should be noted that for complex samples, which virtually is for all sample surveys, you need to use the complex samples module for data analysis. Among the packages mentioned, only the Premium package came included with it. For others, license per user for indefinite time will cost an extra US$1450.

Let's do a little bit of calculation. Let's assume also that you need at least 10 of your computers installed with SPSS to be realistically operational. Then we have (i) Base + Complex Samples, total cost US$25,900 and (ii) Premium, total cost US$75,900. Well, these don't look much for an international donor, right? But think again. If you are mandated to share, or you intend by yourselves to share knowledge with people outside of your organization, that won't work.

This reminds me of a political joke of many years ago about a zawgyi, a mythical magician who at the height of his powers could fly in the air or bore through the earth. Apparently, this particular rookie zawgyi couldn't quite finish building up his magic. So he ended up neither flying nor walking but hovering at a man's height in midair.


For R, base module and any or all packages is free. For example, the “Survey” package is a dedicated package for the analysis of complex samples and you could download it any time you want to.

Obviously, my choice is for the option(B) and I have been advocating the R statistical environment in my earlier posts:

  • Spigot algorithm for calculation of pi (in Teashop PI-I)
  • Pigeon half, five-for-duck, quarter a-sparrow
  • An Unclaimed CD on Psychometrics with R or Intro to Anything with R
  • Big data: small guys could do it?
  • Big data: hands-on correlation, old and new
  • Correlates of labor productivity growth
  • Blind leading the 20/20
  • Econometrics for the Masses, Blind Boy, and Courage
  • Fooling around and having fun with PVT

I call the option (B) mentioned earlier as “Thakhin spirited” and the open-source model an expansion to thousand lights model, as lighting thousand candles from a single source would only increase the sum total of available lights and won't take away the light from the donor candle.

On the other hand, if you like option (A), take it, and then you may like to name it yourselves.  

Sunday, September 20, 2015

Myanmar Land tidbits: Did we miss the boat?


We used to have our headquarters on the middle floor of the Gandhi Hall building on the corner of Merchant Street and Bo Aung Gyaw Street in downtown Yangon. Apart from my other duties, I had to take care of the library. It is not much of a library though and consisted only of four or five large book cases lined across the hallway. Those days I was quite familiar with the contents of these bookcases but would be at a loss to describe them now, save for one.

It was a lather bound book three fingers thick and the title on the book says “Torrens System”. It was in the style of leather binding we find with religious books the book binders on the steps of our great Shwedagon pagoda used to make. I have no doubt therefore that it must have been a priced collection some time before, probably when the British Settlement Officers or the Commissioner were still in office. I went through the book and thought I was able to grasp and appreciate the idea behind the Torrens System. Thinking about its contents now, I was impressed most of all with the spirit of Torrens to make land transactions easy in contrast to the deeds registration system, its principle of indefeasibility of the title, and the attending cadastral system that could accurately reconstruct the boundaries of a given land holding on the ground in case of disputes.

Talking of the Torrens system, I was really surprised when a senior agricultural economist turned political economist, a Myanmar living in Down Under, told me that we have in fact the Torrens system in Myanmar. In my experience of working in the government agency that specifically deals with land administration including assessment of agricultural land tax, maintaining cadastral maps and registers, collecting agricultural statistics and handling land disputes I had never heard of or read about our cadastral system being seen as a Torrens system. I had worked there for 26 years, half of that in the districts and the other half at our headquarters in Yangon.

This friend told me that I could find the reference to the Torrens system in Maung Htin's well known work “Myanma-le-yar-myay-sanit”(Agricultural land system of Myanmar) and if I heard him right, he said this system was used particularly in the “Colony lands”. I was doubly surprised because I am quite familiar with this work and I was definite I didn't notice anything about the Torrens system in there. Afterward I looked for Maung Htin's book, read through it carefully, and yet couldn't find anything of Torrens!

Later, looking for the possible source of reference for Torrens system in Myanmar I found the following in Housing, Land, and Property Rights in Burma, 2004, by Nancy Hudson-Rodd:

The Land Records and Settlement Department in Burma adopted a modified Torrens System of land registration, for all areas settled by the colonial state. British. Burma was conquered in two stages, 1826 Lower Burma and 1886 in Upper Burma, becoming a colony of the British Empire. To suit these different jurisdictions, the Land and Revenue Act 1874 and the Upper Burma Land Revenue Act 1889 were two acts that effected the imposition of a tax to cover the cost of administration and governance by the British colonial government on settled and alienated land in both Lower and Upper Burma. Legal control and classification of land in Burma was initiated by the British in 1876 as part of their introduction of a revenue collection and taxation system. Cadastral surveys were conducted to classify all land according to ownership and use.” (p. 18)

Consulting resources on Torrens system on the Web, as of now, shows that Thailand, Malaysia, Singapore, and Philippines are using the system. A survey on the earlier adoption of the system by J E Hogg entitled Registration of the Title to Land Throughout the Empire, 1920, cited 17 statutes including that of “Federated Malay States”. However, there was nothing on “Burma” as I hastily looked through it.

Going back to Nancy Hudson-Rodd's statement, historical evidence of Myanmar shows that cadastral surveys initiated earlier on holding basis were superseded in 1878 “by field to field surveys on professional lines followed up by regular settlements.” According to Wikipedia entry on “Torrens Title”, the system originated in 1858 in South Australia:

A boom in land speculation and a haphazard grant system resulted in the loss of over 75% of the 40,000 land grants issued in the colony (now state) of South Australia in the early 1800s. To resolve the deficiencies of the common law and deeds registration system, Robert Torrens, a member of the colony's House of Assembly, proposed a new title system in 1858, and it was quickly adopted. The Torrens title system was based on a central registry of all the land in the jurisdiction of South Australia, embodied in the Real Property Act 1886 (SA).”

Recalling that by the time cadastral surveys on professional lines were adopted in Myanmar in 1878, the Torrens system had already been in place in South Australia for 20 years, and so it seems unthinkable that the colonial professionals taking care of cadastral surveys in Myanmar would have been entirely ignorant of the Torrens system. However, it is truly odd that as far as I can ascertain, no historical documents on land revenue administration in Myanmar ever mentioned the Torrens system. Besides, the cadastral system in Myanmar has not been significantly changed from those days till now. From my personal experience, I had never known any of my seniors or juniors ever discussing anything on the Torrens system and I may safely boast that I could have been the only one around that time who had looked through the Torrens book I talked about.

Perhaps Hudson-Rodd was passing her judgment on the characteristics of the rural land registration system in Myanmar as “Torrens like” and not meant to say about its origins. Perhaps my elder economist friend, a collaborator of Hudson-Rodd, has misread Maung Htin. Or was it a quirk of memory lapse?


To me, the real issue is that whether we would call the current system “Torrens like”, “Embryonic Torrens”, or by any other name, we should be doing a reality check. Should we not critically examine the successes of the Torrens system as practiced in Thailand, Malaysia, Singapore, and Philippines to see if we have missed the boat and act accordingly?   

Friday, September 18, 2015

Myanmar Land tidbits


When I came across the claim that “The British did not gave full proprietorship title to land therefore they called the dues collected on land as Land Revenue instead of Land Tax” I wasn't satisfied. I though it must have been just a play of words. To me revenue sounded like referring to what the government got out of the taxation process, while tax is the burden that fall on taxpayers. Nevertheless, it was the conventional wisdom among fellow officers, based on that assertion, that land ownership recognized by the British Government had been some form of inferior ownership. It was some twenty-five years since I had left that government agency, yet some of the papers written by my younger friends in recent times still carried that assertion, without scrutiny, as truth.

I thought I found this assertion in a booklet or a report by some high official of the Land Nationalization Department while I was a government employee. I am not sure, though. Looking back, I wonder if it carries the overtones of the ruling BSPP (Burmese Socialist Programme Party). Too bad I didn't discuss the merit of this assertion with my seniors. Anyway, most of us would have been wise enough those days not to be inquisitive.

By luck I happened to call up one of my retired younger co-workers a few days ago and he was able to email me a scanned page of the source for that assertion. The following excerpt was from the booklet explaining the prevailing settlement procedures for fixing rates of land revenue with the sponsorship of the Revolutionary Government. The booklet was distributed by the Settlements and Land Records Department in 1966.


It reads:
2. At the times of the British government, they did not give full possession of land (proprietorship) to the people. It was rather the right to hold land (Land Holder's Right). If it were proprietorship, the dues collectable on land has to be “Tax on Land”. If the people were treated only as tenants, the dues has to be called “Rent”. As the right on land given to people by the British government was not as good as proprietary, but still better than the mere rights of tenants, neither the term “Tax on Land” nor “Rent” was used and the compromise “Land Revenue” was coined. That was how land revenue came to exist.”

The concluding words of the excerpt seems to say that the the term Land Revenue originated in Myanmar, thanks to the ingenuity of the British administrators. However the British had used this term in India before us for the purpose of land taxation. This is from Full text of "Report Of The Land Revenue Commission Bengal Vol I".

14. ... All Governments in India have considered themselves entitled to a share of the produce, and 
this share of the produce, whether collected direct, or through farmers of revenue, or through 
subordinates or intermediate landlords, is called "land revenue".

On the other hand, the notion of Land Revenue as some halfway concept is easily disproved by this exceprt [The Land and Revenue Act (India Act II, 1876), in The Lower Burma Land Revenue Manual, 1945]:


Here we can see that instead of collecting land revenue for taungya-cultivation (slash and burn cultivation) “tax” will be collected. In this connection, we could easily see that tenure for the land used for taungya was hardly held with a full proprietorship, yet it was called “tax”.

As for whether Land Holder's Right is a proprietorship title would have been a deep and controversial topic. I guess it would have been hotly debated by Myanmar intellectuals and activists at least in the latter part of colonial rule and particularly after Myanmar's independence. Students of Myanmar land systems and historians would have something concrete to say on this topic.

Here, I am aware of the accounts in which Buddhist monks contested the King's confiscation of their religious lands in the Bagan period and won at the courts or tribunals. These episodes could be found in historian Dr Than Tun's Studies in Burmese History Number One, 1969 (pp. 164-166). They seem to signify that the King is not “The Lord paramount over, and the chief proprietor of, the soil” in Myanmar, at least in relation to religious lands.

A stronger judgment is from a younger generation of Myanmar historians. This is from Google Book entry on Thant Myint-U's “The making of modern Burma”:


I assume “A structure of genuinely private ownership, entirely free of gentry or aristocratic control or involvement” effectively means “allodial title”, may be with some restrictions.

In contrast, in page-148 of BSPP's publication “History of Myanmar Land, vol. 1” of 1970, it was stated that “right to ownership in land in reality was a mere right to hold land”.


While in page-85, it was stated that “Nevertheless, under the 1876 Land and Revenue Act, squatters who had worked their lands for 12 years without interruption are entitled to possess the land. So long as they pay land revenue regularly, government cannot evict them. Moreover, such an owner has the right to treat the land as his or her privately owned land and use it as he or she wishes.”


To me, these two statements look contradictory.

Sunday, August 30, 2015

Yan Can Cook or More fun with PVT


I've been a great fan of Martin Yan since 1985 or 86 when I was lucky enough to get a chance to visit the U.S. and watched his TV program "Yan Can Cook". Then there was a lapse for a decade or so when I returned home. After that I was again able to watch him at Marshall Islands in the Pacific for a year or two. Now, I am not sure if I have watched his old or new programs on this side of the new millennium. Anyway, not so long ago I was suddenly nudged by this curiosity to know if the familiar Yan accent is for real or not. I looked that up on the Web. Well, for now I'll leave it to you to find out what I found, or to guess it. That's up to you.
                                              
In one of Yan's program, I was really amazed watching him separate meat from bone and cut up a whole chicken in a snap, just with his big chopper. In another one he showed how to slice onions really fast with this big chopper again. Anyone would have been scared stiff with the idea of slicing onions with a big, heavy, and razor sharp chopper, but in reality the big broad blade itself is the key to superfast slicing while keeping your fingers safe!

Then, after watching so much of Yan, did I learn to cook or slice vegetables with a chopper like him? No, simply because there is someone with me all the time to handle genuine Myanmar day-to-day cuisine really well, or not so. Anyway, if I try to emulate Yan would I do well? Honestly, I don't think so. Yet, I did pick up Yan's philosophy for good: If Yan can cook, so can you.

With these words of Yan's encouragement I tried recently to start learning about creating data collection applications with mobile phones. Among the different software options available, I picked CSEntry because I know a bit of CSPro the mother software of which CSEntry is the data entry module. With CSPro you could develop data entry application for Windows platform or for Android.

The idea is to develop and test the CAPI (computer assisted personal interviewing) application with CSPro software that is running on a Windows computer. For Android phone data collection you need to develop the CSEntry application with the CSPro software version 6.1. Then you would do most of the testing on the Windows machine and finalize the application going back and forth between your desktop and the phone.

After that you compile the data entry application on the Windows machine to get the pen file (say xxx.pen). When you test runs the pen file on the Windows machine, you will get a pff file (say xxx.pff). These two files are all you need to run a data collection application on your Android phone or tablet. Of course you need have the CSEntry program for Android installed on the phone or tablet in the first place.

       The required CSPro 6.1 software and manuals could be downloaded from the U.S. Bureau of Census website here.
       CSEntry for Android could be downloaded to your phone from the Google Play Store.
       Visit the CSPro Users website for goodies on CSPro and CSEntry for Android.

This is how I worked. To make head or tail out of a CAPI application, I played with the "simpleCAPI" application that comes with CSEntry for Android. After graduating from it, I worked through data entry application in the "Examples\CAPI" folder installed with the CSPro 6.1 program on my PC (I was lucky to have some experience working on regular data entry applications on the PC). Then I tried developing a PVT CAPI data entry application for Android on my own. Here, as I have already been posting about parallel vote tabulation on my Bayanathi blog, I felt that a PVT data collection application would not be too hard to do.

As the idea for the exercise is to get a working model for PVT mobile data collection and not much more, I based my application content almost entirely on the PVT sample observer forms given in pages 89 – 90 of the handbook for quick count/PVT by NDI (The Quick Count and Election Observation: An NDI Handbook for Civic Organizations and Political Parties, Estock, Nevitte, and Cowan, 2002).

Here are some screen shots of my PVT_1 application on Android phone.


Working on an Android CSEntry application gives you some refreshing experience you don't get with the desktop application. The checkbox for inputting multiple answers to a single question as in the screen shots above is a beautiful example. Below you can see how it worked the same as using paper questionnaires, but greatly more convenient because it could give you instantly the data you have previously entered that you want to look up.


In the first screen on the left of the screen shots above, the question was "Which parties contested the vote counting results?" For an earlier question, the list of political parties present at vote counting has already been entered. The program performed a check on answers to these two questions and returned the message shown in the middle screen shot. Now tapping the CSEntry logo on top-left corner of the screen brings up the list of all questions and answers entered (known as the Case Tree) and there you can find the previous answer. Then you can correct either or both of the answers as necessary.                                                                     

If you want to try out my application follow these steps:
  1. Install CSEntry on your Android phone/tablet.
  2. By doing so you will also get the application "Simple CAPI" installed in the folder "csentry" on the SD card.
  3. Download pen file from this link: PVT_capi.pen.
  4. Download pff file from this link: PVT_capi.pff.
  5. If you've opened my blog post with your Android phone/tablet, both files will normally be stored in the "Download" folder. Cut and paste them into your "csentry" folder on your Android phone/tablet. Now, if you run CSEntry on your phone/tablet you will see my application "PVT_1". Tap on it and you are on your way to Start New Case and enter data.

I have created this application for fun (and may be some use).

I don't know if it works perfectly or not. I simply don't have the expertise to guarantee anything. Learn CSPro and try to do things on your own, or pick the software of your choice from among other free/open source software available for mobile data-collection application development.

Now it's my turn to say: If Bayanathi can do it, so can you.

Thursday, July 16, 2015

Fooling around and having fun with PVT


Even some election observation experts don't believe in taking samples of voting stations for quick count or PVT. They think it is better to take all the voting stations and do away with the risk of just taking a sample. This idea seems like common sense, but it is flawed.


Although international and domestic groups have con-ducted sample-based PVTs in dozens of countries since 1988, PVTs have sometimes drawn controversy in some quarters of the international community. National election authorities, foreign aid officials, and technical advisers have sometimes questioned the feasibility and accuracy of a vote count verification exercise based on statistical sampling, even though the use of statistical sampling in polling and research is widely accepted among social scientists, media organizations, public opinion researchers and politicians around the world. They also worry that a separate, unofficial vote projection that diverges from the official count might foment postelection unrest.


Misgivings among election authorities and national political elites about the purposes and methodology of PVTs are not surprising. Election authorities rarely like the idea of independent organizations, domestic or foreign, threatening to second guess the official results or offering their own reports of the election outcome. Foreign involvement in such exercises can also be seen as a threat to local sovereignty or hurt national pride because it seems to imply that national authorities require international oversight.


The reason that collecting data from all the units (a census) might not give as reliable results as collecting data from some of the units (a sample) because of the vastly larger scale of operation for the former. This is the well known fact in the census/survey community. Even the seemly simple and routine tasks of collecting vote count results from the voting stations, transmitting them to headquarters, and tabulating the results are no exception to this rule.  


The critically important transitional elections in Indonesia in June 1999 produced considerable controversy among both domestic and international actors.


In response to substantial public mistrust of the official election authorities a coalition of Indonesian universities called the Rectors‟ Forum, with advice from NDI, proposed a sample-based PVT.   
... Apparently, for the first time, however, development agency officials and technical advisers questioned the intellectual basis of a sample-based PVT. In particular, some PVT critics questioned the PVT‟s reliance on statistics. They claimed, incorrectly, that random statistical sampling would not work in the absence of extensive baseline demo-graphic data or could not be used for proportional representation elections. This was a fundamental misunderstanding of the principles of statistics.
Yet because of these unfounded concerns about a sample-based PVT, many Indonesian election and government officials, a number of foreign technical advisers, and some development agency officials initially opposed the PVT. Some urged instead that an independent vote tabulation should consist of a comprehensive PVT, which would at-tempt to collect all the results from several hundred thou-sand polling stations in the country, much as NAMFREL had attempted to do in the Philippines in 1986.


Subsequently, key international actors organized an unofficial comprehensive count in Indonesia, called the Joint Operations Media Center (JOMC). It was organized on the behalf of the Indonesian election commission with funding and technical assistance from American, Australian, and Japanese organizations and the United Nations Development Program (UNDP). Before the election, one of the international organizers promised a “facility . . . capable of reporting reliable results of the elections at the earliest practical moment.”


The JOMC‟s spokesperson told the media he hoped that 50 percent of the results would be known by the day after polling.


... The JOMC was ultimately unable to collect meaningful results. By the morning after election day, it was reporting less than 1/4 of 1 percent of the vote, a meaningless number. Even by three days after the elections, the JOMC could report only 7.8 percent of the vote count, still too small to support any conclusions about the outcome of the elections. ... Rather than reassuring Indonesians and the international community about the integrity of the vote count, the JOMC parallel count actually undermined confidence by raising expectations that it could not meet. Both the sample-based PVT and the comprehensive JOMC ultimately failed to build confidence in the integrity of the reported election results.


Leaving aside the complex issues of PVT vs. exit polls, sample PVT vs. comprehensive PVT, or vote count verification in general, you may like to relax for a moment and have some fun playing around with sample size for PVT using real life voting data. You could do that with what is known as computer simulation. You could learn about the rationale and philosophy and all the nice and impressive things about simulation later, if you like (pardon me, I didn't).


To start with, you will need to have a bit of knowledge about using computers. I would assume that you have installed R on your computer and know how to run a script file with it. If you haven't installed the simFrame package, then install it.


As for the data, download the precinct level 2012 US elections data for Texas from Harvard Dataverse, the Harvard Elections Data Archive. You could download the data file in tab delimited text format, R data format, or the original stata file format. Unfortunately the R data file doesn't work. The stata data file is fine. I don't know for sure if the precinct level elections data means voting station level elections data. I assumed it is so, but it would be no harm for the purpose of our exercise if it is not exactly equal.


The handbook for quick count/PVT by NDI mentioned in my previous post gives detailed description on how to determine the sample size. The report by Committee for Free and Fair Elections in Cambodia (COMFREL), Parallel Vote Tabulation Through Quick Count for 2008 National Assembly Elections, October 2008, showed it followed the NDI approach. Among other resources, ACE encyclopedia (version 1.1) noted "On the whole and probably in a rather random way, one might say that there is an inclination towards doing quick counts on 10% of the population in the case of transition elections (e.g., Chile in 1988, Panama in 1989 and Bulgaria in 1990)." Handbook for Domestic Election Observers by OSCE/ODHR, 2003, observed similarly: "Experience shows that where there is little demographic data and the population is quite diverse, the tendency is to use a relatively large sample, such as 10 per cent of polling stations. Where the opposite is true, a smaller sample can be used and provide sufficiently credible and accurate results for national elections."


In its methodology note on PVT, Pakistan General Elections 2008: Election Results Analysis by Free and Fair Election Network explains:


Experience with past PVTs has shown that drawing a sample of 25-30 polling stations provides sufficient data, within a relatively small margin of sampling error, to assess the reasonableness of official election results. Adding additional polling stations to the sample, even when the number of total polling stations is large, does not improve the margins of sampling error dramatically.


The reason for this statistical principle is that a PVT works with “cluster samples” – each polling station “cluster” averages 1,000 registered voters, and 25 polling stations in a constituency produces a sample of 25,000 voters (25 polling stations x 1,000 voters each) which is much more than statistically sufficient to permit comparisons with official results.


... As part of the world’s largest PVT, almost 16,000 Polling Station Observers (PSOs) from the Free and Fair Election Network (FAFEN) witnessed and recorded the actual vote count in a statistically valid sample of 7,778 randomly- selected polling stations during the 2008 Pakistan National and Provincial Assembly Elections. The national sample of 7,778 polling stations represented almost eight million registered voters.


Common people and even some experts find it hard to believe that taking 25 or 30 voting stations out of a large number of them in a constituency would give good enough estimate for true voting results. For our exercise we have downloaded the Texas data for 2012 elections. It included data for 8952 precincts, of which 278 has 0 votes. It covers election results for U.S. President, for U.S. and State House of Representatives and Senate. For this exercise you will take the votes for the President.


Here's how you could play around with the sample size for PVT. You take simple random sample of 25 precincts out of 8674 with any votes. Then you total up the votes for "g2012_USP_dv" (Democratic votes), "g2012_USP_rv" (Republican votes), and "g2012_USP_tv" (Total votes) for this sample. Then you estimate their totals for Texas.

Theoretically you want to do this for infinite numbers of samples. Obviously you can't. As someone said, running 10,000 samples won't hang your computer and it is close enough to infinity as you could comfortably get. So you would run the simulation with 10,000 samples. Finally you would estimate the total votes for Texas by taking the mean of all the estimates from each of the 10,000 samples.  Then you could compare them with the known results for Texas to see how accurate they are.


Here's how I did that with the simFrame package:












You should get these results with the above code:
(i) For total votes


   Vote_For SimulatedTotVotes TrueTotVotes AccuracyPercent
1  Democrats           3302674      3307609           99.85
2 Republican           4562952      4568788           99.87
3      Total           7986507      7997303           99.86


(ii) For percentage of total votes


   Vote_For  SimulatedPCVotes  TruePCVotes AccuracyPercent
1  Democrats             41.35        41.36           99.98
2 Republican             57.13        57.13          100.00


In sampling terms, a PVT consisted of a sample of clusters (the voting stations). When they differ greatly in "size", the precision of the estimates will suffer. Stratifying the voting stations by "size" and taking samples independently in each of the groups (strata) could improve the precision of the estimates of vote counts.


I guess one way to look into this in our Texas data would be to draw a scatter-plot with the ratio Republican-votes/Democrat-votes on the y-axis and total-votes on the x-axis. We could then see if this ratio changes with the "size" (number of voters) of the precincts.


Here's the scatter-plot:



The same scatter-plot done with the package "hexbin" is here:


Note that they both have regression line drawn in on the graph. From these two graphs, I guess, I could make out that stratification will not be very effective in this situation. I also have a hunch that plain systematic sampling would be good here.


Although this simulation exercise is directed at PVT, it could be useful in help convincing the skeptics that sampling really works. In a sense, I was hoping to give a peek of simulation,  PVT, and sampling to young people and ordinary folks.  Once they are interested, I'm sure they would like to try out the beautiful hexbin plots too.

Look for the resources on simulation, PVT, and hexbin on the Web, learn more, experiment and enjoy (more like advising myself)! Besides, improve on my ideas and codes, would you?