I've got this piece of text
SwanHtet1992 has worked. I know just a bit of R programming. Our
children Htike, Mu, and Chan supported this old couple's modest needs
and so I could pretend that I have all the time in this world to
spare. Not perfect but some workable excuse.
First I looked at what the quanteda
R-package could do. I thought it is logical to break up the text into
its elements, the characters, to start working. So I tried using
quanteda's tokens( ) function to break the text into
characters. My first impression with the results was that it looked
strange to me with some characters consisting of a combination of
multiple Unicode code points while some were of single code points
only. My idea initially was to split the text into single Unicode
characters (code points) and then to combine them or leave them alone
as appropriate to form syllables. So I looked for a tool to do just
that and found out that the function strsplit( ) from
the R Base Package
would do the first part. This is a sample of
the result:
[1] "ပ"
"ိ"
"ဿ"
"ာ"
"ခ"
"ျ"
"ိ"
"န"
"်"
"၁"
"၀"
"သ"
"ာ"
"း"
"ရ"
"ှ"
"ိ"
"သ"
"ေ"
[20] "ာ"
"က"
"ြ"
"က"
"်"
"သ"
"ာ"
"း"
"မ"
"ျ"
"ာ"
"း"
"ခ"
"ျ"
"က"
"်"
"ပ"
"ြ"
"ု"
[39] "တ"
"်"
"က"
"ျ"
"ွ"
"ေ"
"း"
"မ"
"ွ"
"ေ"
"း"
"လ"
"ှ"
"ူ"
"ဒ"
"ါ"
"န"
"်"
"း"
[58] "သ"
"ွ"
"ာ"
"း"
"သ"
"ည"
"်"
"့"
"အ"
"တ"
"ွ"
"က"
"်"
"က"
"ျ"
"ေ"
"း"
"ဇ"
"ူ"
[77] "း"
"တ"
"င"
"်"
"ပ"
"ါ"
"သ"
"ည"
"်"
"။"
Then I tried forming syllables by
combining those Unicode code points (characters). But that was too
hard. The idea seemed quite simple, though: I would just need to add
other symbols such as "ု"
"ှ"
"်"
"း"
"ြ"
"ိ"
to the consonants
"က"
to "အ".
After a bit of fumbling, I arrived at a workable idea
of using what is known as regular expression matching where
each of the given characters is inspected and appropriate action
taken to form syllables. The function I finally found to be useful
was the grepl( ) function.
By luck, I happened to look at
quanteda's tokens( ) output again.
[1]
"ပိ"
"ဿ"
"ာ"
"ချိ"
"န်"
"၁"
"၀"
"သ"
"ာ"
"း"
"ရှိ"
"သေ"
[13]
"ာ"
"ကြ"
"က်"
"သ"
"ာ"
"း"
"မျ"
"ာ"
"း"
"ချ"
"က်"
"ပြု"
[25]
"တ်"
"ကျွေ"
"း"
"မွေ"
"း"
"လှူ"
"ဒ"
"ါ"
"န်"
"း"
"သွ"
"ာ"
[37]
"း"
"သ"
"ည့်"
"အ"
"တွ"
"က်"
"ကျေ"
"း"
"ဇူ"
"း"
"တ"
"င်"
[49]
"ပ"
"ါ"
"သ"
"ည်"
"။"
Then I realized that Unicode code
points that are placed around, above, or below a consonant are
combined with a given consonant and treated as a single character by
tokens( ). This make the process of combining characters or code
points to form a syllable simpler and shorter. For example, when you see a character
"ာ"
or
"း"
or
"်"
or
"ါ",
you will need to combine this character
with the previous one. The following picture of Myanmar Unicode code
block by Wikipedia helped me in devising such “rules” (the line
markings are mine; ignore it or I'll let you decipher them):
For programming in R, you'll need to
write “\u1000” if you want to say "က".
It is
interesting to find out that not everyone knows how to do that and
for myself I got it (and other programming difficulties) right after
a lot of searches in Stack
Overflow and
elsewhere.
One special character is the virama
“\u1039”. This is used for indicating stacked consonants. When
you see this character you know that the character before the virama
is to be displayed on the same line as other characters and the
character after the virama
is to be stacked below the first character. Thus
“\u101e\u1019\u1039\u1019” if correctly printed out will be
"သမ္မ".
Now if I got the ideas right, the real
task will be to implement them. In R programming, Gurus disprove using
loops, but presently loops were the only tool I know how to use. So I
set up two nested loops to produce two nested lists. That
worked, as you could see in my last post.
In swanhtet1992's examples, English or
Myanmar numerals and English words were segmented not as syllables,
but as words. I felt that is the sensible approach because in the
first place we are doing syllable segmentation as a mean for
extracting elements for meaning (or word equivalent) for
Myanmar language text. With this line of thought I think it would be
more appropriate to extract "သမ္မတ"
rather than "သမ္မ" and
"တ".
But this will need more work and perhaps that may be handled when we
proceed to refine our syllables/words for further NLP work.
Also, when I tried handling Myanmar
text, English text, and their numerals at the same time coding became
quite complicated. So I did that in two passes. The first one handles
Myanmar text including numerals and the second pass handle English
text including numerals.
After getting the syllable segmentation
through the initial step of tokenizing characters with quanteda
package I guess the same codes will work with the initial step of
splitting the text into Unicode code points with strsplit( ).
I think the only thing you need to do will be to modify the
parameters of the grepl( ) function as appropriate. I haven't
tried that. You may like to test it?
As always my philosophy is: If Bayanathi can grope, so can you (and may be a lot better). Also, it is easier
done than said!
No comments:
Post a Comment