Wednesday, January 30, 2019

Citizen coding for syllable segmentation of Myanmar text


Usually I would say “Playing with syllable segmentation …” or “A dummy’s codes for syllable segmentation …”. Anyway.

Some time before I was unhappy about the current NLP software being word oriented. I wrote earlier: “But the NLP software, as I know, are presently based on English and English like languages where word is the element for the communication of meaning. Unfortunately our language has no equivalence for this.” Then I chanced to read these words of Professor Hla Pe, “a foremost Burmese language linguist” (From “Burma: Literature, Historiography, Scholarship, Language, Life, and Buddhism”):

Now, I have much more than a glimmer of hope. So if I could arrange syllables appropriately, combine them meaningfully and separate them by blanks I would get text that is structurally same as that has been built by words. Then I would be able to use the widespread “word” oriented NLP software. It is that simple (or naive?). Now I’m optimistic. The initial tasks seems clear. First I’ll need to look at how others are tackling this syllable segmentation task for text in Myanmar language. At the same time grab any collection of Myanmar text in digital form while looking for promising software in the R environment.

My first task of looking for syllable segmentation papers resulted in a fairly quick and clean harvest. I've collected a respectable number of them. But as ignorant of Myanmar-sar and linguistics as I am, I would need time to make sense out of them if I could manage that at all! Of them, one particular presentation of syllable segmentation by Swanhtet1992 on GitHub (Burmese (Myanmar) syllable level segmentation with regex) is interesting because it contained the original texts as well as the segmented results. Inspired by that, I've set out myself to independently produce the same results using some NLP package(s) of R.  

Swanhtet1992’s original text and segmentation results:
This was the summary of my approach: (i) Copied and paste all five pieces of original text to notepad and saved to file with utf8 encoding, (ii) read-in data into R, (iii) tokenize characters of text with quanteda package, and (iv) the resulting characters are inspected by grepl( ) function and characters are combined or left alone as appropriate to form syllables by looping through the data.

To handle the English characters including numerals as Swanhtet1992 did, I run the tokenized data with two passes. The first one handled all the Myanmar characters. The second one handled the English characters. This post has been created with R Notebook in RStudio. The R script that follows could be run with the standard R gui, but as I noted in one of my earlier posts, its console can’t display some utf8 characters including those for Myanmar Unicode. However, you could save the results to a text file with utf8 encoding so that you could see the characters correctly later when you open that file with Notepad, for example.
# found it too complex to incorporate Eng alphabets, dash, 
# and Eng/Myan numerals in single pass.
# (Pass-1): with Myan stacked consonents and numerals
# (Pass-2): with Eng alphabets, dash and numerals
# Jan 29, 2019
x <- readLines(con = "swanhtet1992_ReSegment.txt", encoding = "UTF-8")
# using quanteda package
library(quanteda)
xit <- tokens(x, what = "character")
library(utf8)
utf8_print(xit[[3]])

 [1] "တေ"  "ာ"   "င်"   "ကို"   "ရီ"   "း"   "ယ"   "ာ"   "း"   "အ"   "ခြေ" "စို"   "က်"   "P"  
[15] "o"   "s"   "c"   "o"   "D"   "a"   "e"   "w"   "o"   "o"   "နှ"   "င့်"   "သြ"  "စ"  
[29] "တြေ" "း"   "လျ"  "အ"   "ခြေ" "စို"   "က်"   "W"   "o"   "o"   "d"   "s"   "i"   "d"  
[43] "e"   "တို့"   "အ"   "ကျို"  "း"   "တူ"   "ပူ"   "း"   "ပေ"  "ါ"   "င်"   "း"   "ဆေ"  "ာ"  
[57] "င်"   "ရွ"   "က်"   "နေ"  "သ"   "ည့်"   "ရ"   "ခို"   "င်"   "က"   "မ်"   "း"   "လွ"   "န်"  
[71] "ရှိ"   "A"   "D"   "-"   "7"

utf8_print(syll[[4]])

[1] "၂"  "၀"  "၁"  "၈"  "ခု"  "နှ"  "စ်"  "အ"  "ာ"  "ရှ"  "အ"  "ာ"  "း"  "က"  "စ"  "ာ"  "း" 
[18] "ပြို" "င်"  "ပွဲ"  "တွ"  "င်"  "အ"  "ာ"  "း"  "က"  "စ"  "ာ"  "း"  "န"  "ည်"  "း"  "အ"  "ရေ"
[35] "အ"  "တွ"  "က်"  "တို"  "း"  "မြ" "င့်"  "လ"  "ာ"  "ခဲ့" 
# ----------First pass: Myanmar characters -----------------
syll <- list()
for (k in 1:5) {
    syll.k <- list()
    TEMP <- xit[[k]][1]
    j <- 1
    L <- length(xit[[k]])
    for(i in 2:L) {
        y <- grepl("[\u102b-\u102c\u1038-\u1039\u103a]",xit[[k]][i])
        if (y == TRUE){
            TEMP <- paste0(TEMP,xit[[k]][i])
        } else { 
            my.1 <- grepl("[\u1040-\u1049]", xit[[k]][i])
            my.0 <- grepl("[\u1040-\u1049]", xit[[k]][i-1])
            if (my.1 == TRUE){
                if (my.0 == TRUE){
                    TEMP <- paste0(TEMP,xit[[k]][i])
                } else {
                    syll.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- xit[[k]][i]
                }
            } else {
                if (my.0 == TRUE){
                    syll.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- xit[[k]][i]
                } else {
                    # for stacked consonant
                    if (grepl("[\u1039]",xit[[k]][i-1])==TRUE){
                        TEMP <- paste0(TEMP,xit[[k]][i])
                    } else {
                        syll.k[[j]] <- TEMP
                        j <- j+1
                        TEMP <- xit[[k]][i]
                    }
                }
            }
        }
    }
    if (i == L){
        syll.k[[j]] <- TEMP
    }
    syll[[k]] <- paste(unlist(syll.k))
}
# print first pass samples 3 & 4
utf8_print(syll[[3]])
[1] "တောင်"  "ကို"     "ရီး"    "ယား"   "အ"     "ခြေ"   "စိုက်"    "P"     "o"     "s"    
[11] "c"     "o"     "D"     "a"     "e"     "w"     "o"     "o"     "နှင့်"    "သြ"   
[21] "စ"     "တြေး"  "လျ"    "အ"     "ခြေ"   "စိုက်"    "W"     "o"     "o"     "d"    
[31] "s"     "i"     "d"     "e"     "တို့"     "အ"     "ကျိုး"   "တူ"     "ပူး"    "ပေါင်း"
[41] "ဆောင်"  "ရွက်"    "နေ"    "သည့်"    "ရ"     "ခိုင်"    "ကမ်း"   "လွန်"    "ရှိ"     "A"    
[51] "D"     "-"     "7"  
utf8_print(syll[[4]])
 [1] "၂၀၁၈" "ခု"    "နှစ်"   "အာ"   "ရှ"    "အား"  "က"    "စား"  "ပြိုင်"  "ပွဲ"    "တွင်"   "အား" 
[13] "က"    "စား"  "နည်း"  "အ"    "ရေ"   "အ"    "တွက်"   "တိုး"   "မြင့်"  "လာ"   "ခဲ့"  
# ------ Second pass: Eng characters including numerals ---
SYLL <- list()
#SYLL.k <- list()
for (k in 1:5) {
    SYLL.k <- list()
    TEMP <- syll[[k]][1]
    j <- 1
    L <- length(syll[[k]])
    for(i in 2:L) {
        en.1 <- grepl("[\u002d\u0030-\u0039\u0041-\u005a\u0061-\u007a]", syll[[k]][i])
        en.0 <- grepl("[\u002d\u0030-\u0039\u0041-\u005a\u0061-\u007a]", syll[[k]][i-1])
        if (en.1 == TRUE){
            if (en.0 == TRUE){
                TEMP <- paste0(TEMP,syll[[k]][i])
            } else {
                SYLL.k[[j]] <- TEMP
                j <- j+1
                TEMP <- syll[[k]][i]
            }
        # current char not English
        } else {
            SYLL.k[[j]] <- TEMP
            j <- j+1
            TEMP <- syll[[k]][i]
        }
    }
    if (i == L){
        SYLL.k[[j]] <- TEMP
    }
SYLL[[k]] <- paste(unlist(SYLL.k))
}
Write the syllable segmentation results to a text file: “SYLL.txt”.
zz <- file("SYLL.txt", "w")
for (m in 1:5) {
    writeLines(paste0("'",SYLL[[m]],"'", collapse =  " "),con = zz, useBytes = TRUE)
}
close(zz)
When you open that text file with Notepad (by double-clicking) you can see one problem:


However, the output on the Rstudio console shows it correctly.
syll_text <- readLines(con = "SYLL.txt", encoding = "UTF-8")
for (m in 1:5) {
  cat(syll_text[m],"\n","\n")
}
Our results successfully produces identical syllables (except for blanks in Myanmar text) as in SwanHtet1992's results except in one place. For the fourth piece of text, “PoscoDaewoo” is produced without space in-between as it should have. This is because when “quanteda” tokenize characters the blank character was dropped, and I couldn't find the way to correct it. Perhaps you could.
I noticed that the blanks that appeared in the SwanHtet1992's original Myanmar “Text”“ also appeared in SwanHtet1992's "Result”. Since we are looking for syllables I think they may be ignored.

No comments:

Post a Comment