Having attempted a mobile-phone data
collection application for parallel vote tabulation (Yan Can Cook
or More fun with PVT) I was
hugely frustrated with the problem of displaying Myanmar characters
(Myanglish PVT) on Android mobile phones. So I ranted
“As mentioned in my
last post Myanglish
PVT,
I was frustrated, so much frustrated in failing to get a respectable
version of PVT data collection application in Myanmar language.”
and posted “PVT tawla or lost in font jungle”. Luckily a
few months back I received a Nexus-6 by Motorola from my son and I
was overjoyed to find it displaying the Myanmar-Sar in my PVT
application perfectly. For Myanmar-Sar input I installed the Bagan
Keyboard and Myanmar Keyboard by Abbott Cullen and they both worked
well.
In an application like PVT, Myanmar-Sar
is needed only for assisting the interviewer in the form of question
wordings, instruction to the interviewer, and showing the response
categories to choose from. In the data file they will be represented
by codes (in English) and the only textual data entered in
Myanmar-Sar would be the name of the candidate. This is the way most
survey data files would be organized, that is, most of the entries
would be in precoded categories. However, in certain situations like
data collection for civil registry, or for recording birds and animal
sightings we may need to enter a lot of data as text in Myanmar-Sar.
How could we work with Myanmar-Sar in
situations like the latter, specifically in R? Searching for the
answer in Stackoverflow, which I guess is the most likely
place, I found out that it wasn't as easy as I first thought. After
working through a number of questions and answers on handling
non-English characters like Chinese or Russian, I stumbled upon a
workable approach and it will be the theme of this post. But first
let's look at question 17934847
asked two years ago by dnari which still needs an answer:
It relates specifically to the Mac
environment, we were told, but we could still try imitating it in
Windows. Since no link was provided for the data “data.csv” I
will be using Myanmar Information Management Unit (MIMU) place codes
for States/Regions of Myanmar available here.
The unzipped file is an Excel file with .xlsx extension. I
opened the file in Open Office's Calc spreadsheet and saved its
“State_Region” sheet as comma separated values
text file “State_Region.csv” in
UTF-8
encoding. Opened with
notepad, Myanmar-Sa came out of it perfectly:
In the following R
script I tried to replicate what dinari has done.
In dinari's
question post, cases[[1]] is
shown to produce Myanmar-Sar output:
Actually,
it is totally different from what is produced in my test of his
codes. Also I couldn't find in his code fragment any command that
could have produced his text output. I guess dinari
would have directed the output to a text file. After all, dinari
himself remarked “[the rest is a text file …”.
Even then cases[[1]]
couldn't produce the
text of the corpus as we have seen from running my script. Then you
need to type cases[[1]][1]
to get it. Also, to write out a text file from a corpus, we need to
convert it first into
characters as shown in
my R script.
Furthermore,
cases
contains 20 corpora and to write all of them to a text file, we need
to convert the corpora to a dataframe. Thanks to Ken Benoit's answer
for the stackoverflow question
“33193152/unable-to-convert-a-corpus-to-data-frame-in-r”
it could be done like this:
Opening
the written “caseAll2csv.csv” file with notepad:
Credit: I found the correct way
to write unicode Myanmar-Sar to text file from petermeissner's
answer to stackoverflow question 10675360/utf-8-file-output-in-r.