Thursday, January 31, 2019

Groping for syllable segmentation


I've got this piece of text SwanHtet1992 has worked. I know just a bit of R programming. Our children Htike, Mu, and Chan supported this old couple's modest needs and so I could pretend that I have all the time in this world to spare. Not perfect but some workable excuse.

First I looked at what the quanteda R-package could do. I thought it is logical to break up the text into its elements, the characters, to start working. So I tried using quanteda's tokens( ) function to break the text into characters. My first impression with the results was that it looked strange to me with some characters consisting of a combination of multiple Unicode code points while some were of single code points only. My idea initially was to split the text into single Unicode characters (code points) and then to combine them or leave them alone as appropriate to form syllables. So I looked for a tool to do just that and found out that the function strsplit( ) from the R Base Package would do the first part. This is a sample of the result:
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[20] "" "က" "" "က" "" "" "" "" "" "" "" "" "" "" "က" "" "" "" ""
[39] "" "" "က" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[58] "" "" "" "" "" "" "" "" "" "" "" "က" "" "က" "" "" "" "" ""
[77] "" "" "" "" "" "" "" "" "" ""

Then I tried forming syllables by combining those Unicode code points (characters). But that was too hard. The idea seemed quite simple, though: I would just need to add other symbols such as "" "" "" "" "" "" to the consonants "က" to "". After a bit of fumbling, I arrived at a workable idea of using what is known as regular expression matching where each of the given characters is inspected and appropriate action taken to form syllables. The function I finally found to be useful was the grepl( ) function.

By luck, I happened to look at quanteda's tokens( ) output again.
[1] "ပိ" "" "" "ချိ" "န်" "" "" "" "" "" "ရှိ" "သေ"
[13] "" "ကြ" "က်" "" "" "" "မျ" "" "" "ချ" "က်" "ပြု"
[25] "တ်" "ကျွေ" "" "မွေ" "" "လှူ" "" "" "န်" "" "သွ" ""
[37] "" "" "ည့်" "" "တွ" "က်" "ကျေ" "" "ဇူ" "" "" "င်"
[49] "" "" "" "ည်" ""

Then I realized that Unicode code points that are placed around, above, or below a consonant are combined with a given consonant and treated as a single character by tokens( ). This make the process of combining characters or code points to form a syllable simpler and shorter. For example, when you see a character "" or "" or "" or "", you will need to combine this character with the previous one. The following picture of Myanmar Unicode code block by Wikipedia helped me in devising such “rules” (the line markings are mine; ignore it or I'll let you decipher them):


For programming in R, you'll need to write “\u1000” if you want to say "က". It is interesting to find out that not everyone knows how to do that and for myself I got it (and other programming difficulties) right after a lot of searches in Stack Overflow and elsewhere. One special character is the virama “\u1039”. This is used for indicating stacked consonants. When you see this character you know that the character before the virama is to be displayed on the same line as other characters and the character after the virama is to be stacked below the first character. Thus “\u101e\u1019\u1039\u1019” if correctly printed out will be "သမ္မ".

Now if I got the ideas right, the real task will be to implement them. In R programming, Gurus disprove using loops, but presently loops were the only tool I know how to use. So I set up two nested loops to produce two nested lists. That worked, as you could see in my last post.

In swanhtet1992's examples, English or Myanmar numerals and English words were segmented not as syllables, but as words. I felt that is the sensible approach because in the first place we are doing syllable segmentation as a mean for extracting elements for meaning (or word equivalent) for Myanmar language text. With this line of thought I think it would be more appropriate to extract "သမ္မတ" rather than "သမ္မ" and "". But this will need more work and perhaps that may be handled when we proceed to refine our syllables/words for further NLP work.

Also, when I tried handling Myanmar text, English text, and their numerals at the same time coding became quite complicated. So I did that in two passes. The first one handles Myanmar text including numerals and the second pass handle English text including numerals.

After getting the syllable segmentation through the initial step of tokenizing characters with quanteda package I guess the same codes will work with the initial step of splitting the text into Unicode code points with strsplit( ). I think the only thing you need to do will be to modify the parameters of the grepl( ) function as appropriate. I haven't tried that. You may like to test it?

As always my philosophy is: If Bayanathi can grope, so can you (and may be a lot better). Also, it is easier done than said!

Wednesday, January 30, 2019

Citizen coding for syllable segmentation of Myanmar text


Usually I would say “Playing with syllable segmentation …” or “A dummy’s codes for syllable segmentation …”. Anyway.

Some time before I was unhappy about the current NLP software being word oriented. I wrote earlier: “But the NLP software, as I know, are presently based on English and English like languages where word is the element for the communication of meaning. Unfortunately our language has no equivalence for this.” Then I chanced to read these words of Professor Hla Pe, “a foremost Burmese language linguist” (From “Burma: Literature, Historiography, Scholarship, Language, Life, and Buddhism”):

Now, I have much more than a glimmer of hope. So if I could arrange syllables appropriately, combine them meaningfully and separate them by blanks I would get text that is structurally same as that has been built by words. Then I would be able to use the widespread “word” oriented NLP software. It is that simple (or naive?). Now I’m optimistic. The initial tasks seems clear. First I’ll need to look at how others are tackling this syllable segmentation task for text in Myanmar language. At the same time grab any collection of Myanmar text in digital form while looking for promising software in the R environment.

My first task of looking for syllable segmentation papers resulted in a fairly quick and clean harvest. I've collected a respectable number of them. But as ignorant of Myanmar-sar and linguistics as I am, I would need time to make sense out of them if I could manage that at all! Of them, one particular presentation of syllable segmentation by Swanhtet1992 on GitHub (Burmese (Myanmar) syllable level segmentation with regex) is interesting because it contained the original texts as well as the segmented results. Inspired by that, I've set out myself to independently produce the same results using some NLP package(s) of R.  

Swanhtet1992’s original text and segmentation results:
This was the summary of my approach: (i) Copied and paste all five pieces of original text to notepad and saved to file with utf8 encoding, (ii) read-in data into R, (iii) tokenize characters of text with quanteda package, and (iv) the resulting characters are inspected by grepl( ) function and characters are combined or left alone as appropriate to form syllables by looping through the data.

To handle the English characters including numerals as Swanhtet1992 did, I run the tokenized data with two passes. The first one handled all the Myanmar characters. The second one handled the English characters. This post has been created with R Notebook in RStudio. The R script that follows could be run with the standard R gui, but as I noted in one of my earlier posts, its console can’t display some utf8 characters including those for Myanmar Unicode. However, you could save the results to a text file with utf8 encoding so that you could see the characters correctly later when you open that file with Notepad, for example.
# found it too complex to incorporate Eng alphabets, dash, 
# and Eng/Myan numerals in single pass.
# (Pass-1): with Myan stacked consonents and numerals
# (Pass-2): with Eng alphabets, dash and numerals
# Jan 29, 2019
x <- readLines(con = "swanhtet1992_ReSegment.txt", encoding = "UTF-8")
# using quanteda package
library(quanteda)
xit <- tokens(x, what = "character")
library(utf8)
utf8_print(xit[[3]])

 [1] "တေ"  "ာ"   "င်"   "ကို"   "ရီ"   "း"   "ယ"   "ာ"   "း"   "အ"   "ခြေ" "စို"   "က်"   "P"  
[15] "o"   "s"   "c"   "o"   "D"   "a"   "e"   "w"   "o"   "o"   "နှ"   "င့်"   "သြ"  "စ"  
[29] "တြေ" "း"   "လျ"  "အ"   "ခြေ" "စို"   "က်"   "W"   "o"   "o"   "d"   "s"   "i"   "d"  
[43] "e"   "တို့"   "အ"   "ကျို"  "း"   "တူ"   "ပူ"   "း"   "ပေ"  "ါ"   "င်"   "း"   "ဆေ"  "ာ"  
[57] "င်"   "ရွ"   "က်"   "နေ"  "သ"   "ည့်"   "ရ"   "ခို"   "င်"   "က"   "မ်"   "း"   "လွ"   "န်"  
[71] "ရှိ"   "A"   "D"   "-"   "7"

utf8_print(syll[[4]])

[1] "၂"  "၀"  "၁"  "၈"  "ခု"  "နှ"  "စ်"  "အ"  "ာ"  "ရှ"  "အ"  "ာ"  "း"  "က"  "စ"  "ာ"  "း" 
[18] "ပြို" "င်"  "ပွဲ"  "တွ"  "င်"  "အ"  "ာ"  "း"  "က"  "စ"  "ာ"  "း"  "န"  "ည်"  "း"  "အ"  "ရေ"
[35] "အ"  "တွ"  "က်"  "တို"  "း"  "မြ" "င့်"  "လ"  "ာ"  "ခဲ့" 
# ----------First pass: Myanmar characters -----------------
syll <- list()
for (k in 1:5) {
    syll.k <- list()
    TEMP <- xit[[k]][1]
    j <- 1
    L <- length(xit[[k]])
    for(i in 2:L) {
        y <- grepl("[\u102b-\u102c\u1038-\u1039\u103a]",xit[[k]][i])
        if (y == TRUE){
            TEMP <- paste0(TEMP,xit[[k]][i])
        } else { 
            my.1 <- grepl("[\u1040-\u1049]", xit[[k]][i])
            my.0 <- grepl("[\u1040-\u1049]", xit[[k]][i-1])
            if (my.1 == TRUE){
                if (my.0 == TRUE){
                    TEMP <- paste0(TEMP,xit[[k]][i])
                } else {
                    syll.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- xit[[k]][i]
                }
            } else {
                if (my.0 == TRUE){
                    syll.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- xit[[k]][i]
                } else {
                    # for stacked consonant
                    if (grepl("[\u1039]",xit[[k]][i-1])==TRUE){
                        TEMP <- paste0(TEMP,xit[[k]][i])
                    } else {
                        syll.k[[j]] <- TEMP
                        j <- j+1
                        TEMP <- xit[[k]][i]
                    }
                }
            }
        }
    }
    if (i == L){
        syll.k[[j]] <- TEMP
    }
    syll[[k]] <- paste(unlist(syll.k))
}
# print first pass samples 3 & 4
utf8_print(syll[[3]])
[1] "တောင်"  "ကို"     "ရီး"    "ယား"   "အ"     "ခြေ"   "စိုက်"    "P"     "o"     "s"    
[11] "c"     "o"     "D"     "a"     "e"     "w"     "o"     "o"     "နှင့်"    "သြ"   
[21] "စ"     "တြေး"  "လျ"    "အ"     "ခြေ"   "စိုက်"    "W"     "o"     "o"     "d"    
[31] "s"     "i"     "d"     "e"     "တို့"     "အ"     "ကျိုး"   "တူ"     "ပူး"    "ပေါင်း"
[41] "ဆောင်"  "ရွက်"    "နေ"    "သည့်"    "ရ"     "ခိုင်"    "ကမ်း"   "လွန်"    "ရှိ"     "A"    
[51] "D"     "-"     "7"  
utf8_print(syll[[4]])
 [1] "၂၀၁၈" "ခု"    "နှစ်"   "အာ"   "ရှ"    "အား"  "က"    "စား"  "ပြိုင်"  "ပွဲ"    "တွင်"   "အား" 
[13] "က"    "စား"  "နည်း"  "အ"    "ရေ"   "အ"    "တွက်"   "တိုး"   "မြင့်"  "လာ"   "ခဲ့"  
# ------ Second pass: Eng characters including numerals ---
SYLL <- list()
#SYLL.k <- list()
for (k in 1:5) {
    SYLL.k <- list()
    TEMP <- syll[[k]][1]
    j <- 1
    L <- length(syll[[k]])
    for(i in 2:L) {
        en.1 <- grepl("[\u002d\u0030-\u0039\u0041-\u005a\u0061-\u007a]", syll[[k]][i])
        en.0 <- grepl("[\u002d\u0030-\u0039\u0041-\u005a\u0061-\u007a]", syll[[k]][i-1])
        if (en.1 == TRUE){
            if (en.0 == TRUE){
                TEMP <- paste0(TEMP,syll[[k]][i])
            } else {
                SYLL.k[[j]] <- TEMP
                j <- j+1
                TEMP <- syll[[k]][i]
            }
        # current char not English
        } else {
            SYLL.k[[j]] <- TEMP
            j <- j+1
            TEMP <- syll[[k]][i]
        }
    }
    if (i == L){
        SYLL.k[[j]] <- TEMP
    }
SYLL[[k]] <- paste(unlist(SYLL.k))
}
Write the syllable segmentation results to a text file: “SYLL.txt”.
zz <- file("SYLL.txt", "w")
for (m in 1:5) {
    writeLines(paste0("'",SYLL[[m]],"'", collapse =  " "),con = zz, useBytes = TRUE)
}
close(zz)
When you open that text file with Notepad (by double-clicking) you can see one problem:


However, the output on the Rstudio console shows it correctly.
syll_text <- readLines(con = "SYLL.txt", encoding = "UTF-8")
for (m in 1:5) {
  cat(syll_text[m],"\n","\n")
}
Our results successfully produces identical syllables (except for blanks in Myanmar text) as in SwanHtet1992's results except in one place. For the fourth piece of text, “PoscoDaewoo” is produced without space in-between as it should have. This is because when “quanteda” tokenize characters the blank character was dropped, and I couldn't find the way to correct it. Perhaps you could.
I noticed that the blanks that appeared in the SwanHtet1992's original Myanmar “Text”“ also appeared in SwanHtet1992's "Result”. Since we are looking for syllables I think they may be ignored.