Thursday, January 31, 2019

Groping for syllable segmentation


I've got this piece of text SwanHtet1992 has worked. I know just a bit of R programming. Our children Htike, Mu, and Chan supported this old couple's modest needs and so I could pretend that I have all the time in this world to spare. Not perfect but some workable excuse.

First I looked at what the quanteda R-package could do. I thought it is logical to break up the text into its elements, the characters, to start working. So I tried using quanteda's tokens( ) function to break the text into characters. My first impression with the results was that it looked strange to me with some characters consisting of a combination of multiple Unicode code points while some were of single code points only. My idea initially was to split the text into single Unicode characters (code points) and then to combine them or leave them alone as appropriate to form syllables. So I looked for a tool to do just that and found out that the function strsplit( ) from the R Base Package would do the first part. This is a sample of the result:
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[20] "" "က" "" "က" "" "" "" "" "" "" "" "" "" "" "က" "" "" "" ""
[39] "" "" "က" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[58] "" "" "" "" "" "" "" "" "" "" "" "က" "" "က" "" "" "" "" ""
[77] "" "" "" "" "" "" "" "" "" ""

Then I tried forming syllables by combining those Unicode code points (characters). But that was too hard. The idea seemed quite simple, though: I would just need to add other symbols such as "" "" "" "" "" "" to the consonants "က" to "". After a bit of fumbling, I arrived at a workable idea of using what is known as regular expression matching where each of the given characters is inspected and appropriate action taken to form syllables. The function I finally found to be useful was the grepl( ) function.

By luck, I happened to look at quanteda's tokens( ) output again.
[1] "ပိ" "" "" "ချိ" "န်" "" "" "" "" "" "ရှိ" "သေ"
[13] "" "ကြ" "က်" "" "" "" "မျ" "" "" "ချ" "က်" "ပြု"
[25] "တ်" "ကျွေ" "" "မွေ" "" "လှူ" "" "" "န်" "" "သွ" ""
[37] "" "" "ည့်" "" "တွ" "က်" "ကျေ" "" "ဇူ" "" "" "င်"
[49] "" "" "" "ည်" ""

Then I realized that Unicode code points that are placed around, above, or below a consonant are combined with a given consonant and treated as a single character by tokens( ). This make the process of combining characters or code points to form a syllable simpler and shorter. For example, when you see a character "" or "" or "" or "", you will need to combine this character with the previous one. The following picture of Myanmar Unicode code block by Wikipedia helped me in devising such “rules” (the line markings are mine; ignore it or I'll let you decipher them):


For programming in R, you'll need to write “\u1000” if you want to say "က". It is interesting to find out that not everyone knows how to do that and for myself I got it (and other programming difficulties) right after a lot of searches in Stack Overflow and elsewhere. One special character is the virama “\u1039”. This is used for indicating stacked consonants. When you see this character you know that the character before the virama is to be displayed on the same line as other characters and the character after the virama is to be stacked below the first character. Thus “\u101e\u1019\u1039\u1019” if correctly printed out will be "သမ္မ".

Now if I got the ideas right, the real task will be to implement them. In R programming, Gurus disprove using loops, but presently loops were the only tool I know how to use. So I set up two nested loops to produce two nested lists. That worked, as you could see in my last post.

In swanhtet1992's examples, English or Myanmar numerals and English words were segmented not as syllables, but as words. I felt that is the sensible approach because in the first place we are doing syllable segmentation as a mean for extracting elements for meaning (or word equivalent) for Myanmar language text. With this line of thought I think it would be more appropriate to extract "သမ္မတ" rather than "သမ္မ" and "". But this will need more work and perhaps that may be handled when we proceed to refine our syllables/words for further NLP work.

Also, when I tried handling Myanmar text, English text, and their numerals at the same time coding became quite complicated. So I did that in two passes. The first one handles Myanmar text including numerals and the second pass handle English text including numerals.

After getting the syllable segmentation through the initial step of tokenizing characters with quanteda package I guess the same codes will work with the initial step of splitting the text into Unicode code points with strsplit( ). I think the only thing you need to do will be to modify the parameters of the grepl( ) function as appropriate. I haven't tried that. You may like to test it?

As always my philosophy is: If Bayanathi can grope, so can you (and may be a lot better). Also, it is easier done than said!

No comments:

Post a Comment