Saturday, July 30, 2016

Myanmar-Sar in R - I


Having attempted a mobile-phone data collection application for parallel vote tabulation (Yan Can Cook or More fun with PVT) I was hugely frustrated with the problem of displaying Myanmar characters (Myanglish PVT) on Android mobile phones. So I ranted “As mentioned in my last post Myanglish PVT, I was frustrated, so much frustrated in failing to get a respectable version of PVT data collection application in Myanmar language.” and posted “PVT tawla or lost in font jungle”. Luckily a few months back I received a Nexus-6 by Motorola from my son and I was overjoyed to find it displaying the Myanmar-Sar in my PVT application perfectly. For Myanmar-Sar input I installed the Bagan Keyboard and Myanmar Keyboard by Abbott Cullen and they both worked well.

In an application like PVT, Myanmar-Sar is needed only for assisting the interviewer in the form of question wordings, instruction to the interviewer, and showing the response categories to choose from. In the data file they will be represented by codes (in English) and the only textual data entered in Myanmar-Sar would be the name of the candidate. This is the way most survey data files would be organized, that is, most of the entries would be in precoded categories. However, in certain situations like data collection for civil registry, or for recording birds and animal sightings we may need to enter a lot of data as text in Myanmar-Sar.

How could we work with Myanmar-Sar in situations like the latter, specifically in R? Searching for the answer in Stackoverflow, which I guess is the most likely place, I found out that it wasn't as easy as I first thought. After working through a number of questions and answers on handling non-English characters like Chinese or Russian, I stumbled upon a workable approach and it will be the theme of this post. But first let's look at question 17934847 asked two years ago by dnari which still needs an answer:


It relates specifically to the Mac environment, we were told, but we could still try imitating it in Windows. Since no link was provided for the data “data.csv” I will be using Myanmar Information Management Unit (MIMU) place codes for States/Regions of Myanmar available here. The unzipped file is an Excel file with .xlsx extension. I opened the file in Open Office's Calc spreadsheet and saved its “State_Region” sheet as comma separated values text file “State_Region.csv” in UTF-8 encoding. Opened with notepad, Myanmar-Sa came out of it perfectly:


In the following R script I tried to replicate what dinari has done.


In dinari's question post, cases[[1]] is shown to produce Myanmar-Sar output:


Actually, it is totally different from what is produced in my test of his codes. Also I couldn't find in his code fragment any command that could have produced his text output. I guess dinari would have directed the output to a text file. After all, dinari himself remarked “[the rest is a text file …”. Even then cases[[1]] couldn't produce the text of the corpus as we have seen from running my script. Then you need to type cases[[1]][1] to get it. Also, to write out a text file from a corpus, we need to convert it first into characters as shown in my R script.

Furthermore, cases contains 20 corpora and to write all of them to a text file, we need to convert the corpora to a dataframe. Thanks to Ken Benoit's answer for the stackoverflow question “33193152/unable-to-convert-a-corpus-to-data-frame-in-r” it could be done like this:


Opening the written “caseAll2csv.csv” file with notepad:




Credit: I found the correct way to write unicode Myanmar-Sar to text file from petermeissner's answer to stackoverflow question 10675360/utf-8-file-output-in-r.