Thursday, February 7, 2019

Courage and Evil WORDS

Before, I’d looked at some of the papers on word segmentation of Myanmar text. Then I felt it’ll be a long way off for me. Suddenly I have a bag of Myanmar syllables of my own, better or worse. I looked at resources on word segmentation of Myanmar text again. Google search gave me strange-looking terms like Syllable level Longest Matching; syllable segmentation and syllable merging. A rule-based heuristic approach … for syllable segmentation, and a dictionary-based statistical approach for syllable merging; Foma-generated Finite State Automata; Conditional Random Fields, … .
For us to know what has been done in this area I found that Saini (June, 2016) has provided a “First Classified Annotated Bibliography of NLP Tasks in the Burmese Language” containing a description of nine approaches “On Word Identification, Segmentation, Disambiguation, Collation, Semantic Parsing and Tokenization for Burmese Language”.

Hmm, let me earmark all the above, and more, for the future. For now, I will shuffle on like a dummy would do in identifying words with an unfamiliar language: look for the given words in a dictionary or a word list. The catch with Myanmar text is that words don’t come separated by space. The good news is that I’ve already segmented the text into syllables and a “word” would just be a syllable or a combination of syllables depending on the text. I guess I could do like this:
(i)Take the first syllable in the text, and look for a word in the word-list/dictionary that starts with the given syllable,
(ii-a)if there is at least one match, move to next syllable (iii-a)add the current syllable to the previous syllable, look for a word in the word-list/dictionary that starts with this merged-syllable
(ii-b)if no match, take the current syllable as word (and mark it with “*” to show it is not in the dictionary) (iii-b)move to the next syllable and look for a word in the word-list/dictionary that starts with this syllable
Repeat (-a) or (-b) steps as appropriate until all is done.

The following is my R script for word segmentation:
# Kanaung WORD LIST downloaded Feb06, 2019 from raw at 
# https://raw.githubusercontent.com/kanaung/wordlists/master/wordlists.list
# saved file: wordlists.list.txt
# knWL_1 <- readLines(con = "wordlists.list.txt", encoding = "UTF-8")
# save(knWL_1,file = "knWL_12449.RData")
# wikitionary dict downloaded, feb 06, 2019 from raw at
# https://raw.githubusercontent.com/kanaung/wordlists/master/wikitionary/
##  mywiktionary20150901pagesarticlesmultistream.xml.out.list.sorted.only-mm.txt
# wikitionary_1 <- readLines(con = "mywiktionary20150901.txt", encoding = "UTF-8")
# save(wikitionary_1,file = "wikitionary_1.RData")
load("SYLL.RData")
load("knWL_12449.RData")
load("wikitionary_1.RData")
Run word segmentation; matching with Kanaung Word List (12,467 words)
word <- list()
dict <- knWL_1
for (k in 1:5){
    word.k <- list()
    L <- length(SYLL[[k]])
    j <- 1
    x <- grep(paste0("^",SYLL[[k]][1]),dict)
    if (length(x)==0){
        word.k[[j]] <- paste0("*",SYLL[[k]][1])
        j <- j+1
        TEMP.0 <- ""
    } else {
        TEMP.0 <- SYLL[[k]][1]
    }
    for (i in 2:L){
        x <- grep(paste0("^",SYLL[[k]][i]),dict)
        if (TEMP.0==""){
            # current char has 0 match in dict  
            if (length(x) == 0){
                word.k[[j]] <- paste0("*",SYLL[[k]][i])
                j <- j + 1
                TEMP.0 <- ""
            # current char has +1 match in dict
            }else{
                TEMP.0 <- SYLL[[k]][i]
            }
        # previous syllable cluster not empty
        } else {
            TEMP.1 <- paste0(TEMP.0,SYLL[[k]][i])
            y <- grep(paste0("^",TEMP.1),dict)
            # previous syll cluster+current, no match in dict
            if (length(y) == 0){
                word.k[[j]] <- TEMP.0
                j <- j + 1
                if (length(x) == 0){
                    word.k[[j]] <- paste0("*",SYLL[[k]][i])
                    j <- j + 1
                    TEMP.0 <- ""
                } else {
                TEMP.0 <- SYLL[[k]][i]
                }
            } else {
                TEMP.0 <- paste0(TEMP.0,SYLL[[k]][i])
            }
        }
    }
    if (i == L){
        word.k[[j]] <- TEMP.0
    }
    word[[k]] <- paste(unlist(word.k))
}
Show the results of word segmentation using Kanaung Word List
utf8::utf8_print(unlist(word))
  [1] "ဝန်ကြီးချုပ်"     "ဦး"           "ဖြိုး"          "မင်း"          "သိန်း"         
  [6] "*ခွင့်"          "ထပ်"           "တိုင်"           "ရန်ကုန်တိုင်း"      "လွှတ်တော်"       
 [11] "မှာ"           "*YBS"         "စနစ်"          "ပြုပြင်"        "ပြောင်းလဲ"     
 [16] "ဖို့"            "တင်"           "*သွင်း"         "တဲ့"            "အ"           
 [21] "ဆို"            "အတည်"          "ပြု"           "ဖို့"            "*၊"          
 [26] "မ"            "ပြု"           "ဖို့"            "ဆုံးဖြတ်"        "*မယ့်"         
 [31] "မနက်ဖြန်"       "လွှတ်တော်"        "အစည်းအဝေး"     "ကို"            "လည်း"         
 [36] "ဝန်ကြီးချုပ်"     "ဦး"           "ဖြိုး"          "မင်း"          "သိန်း"         
 [41] "က"            "*ခွင့်"          "ထပ်"           "တိုင်"           "ခဲ့"           
 [46] "ကြောင်း"       "သိ"            "ရ"            "ပါ"           "တယ်"          
 [51] "စီ"            "အိုင်"           "အေ"           "က"            "နှိပ်စက်"        
 [56] "စစ်"           "ဆေး"          "မှု"            "တွေ"           "လုပ်"          
 [61] "ခဲ့"            "အမေ"          "ရိ"            "ကန်"           "*-"          
 [66] "ဗဟို"           "ထောက်လှမ်း"      "ရေး"          "ဌာန"          "*CIA"        
 [71] "ဟာ"           "သမ္မတ"         "ဟောင်း"        "*ဂျော့ချ်"      "*ဘုရှ်"         
 [76] "လက်"           "ထက်"           "စက်တင်ဘာ"       "*၁၁"          "ရက်"          
 [81] "တိုက်"           "ခိုက်"           "ခံရ"           "မှု"            "နောက်"        
 [86] "ပိုင်း"          "စစ်"           "ဆေး"          "မှု"            "တွေ"          
 [91] "လုပ်"           "ရာ"           "မှာ"           "နှိပ်စက်"         "ညှင်း"         
 [96] "ပန်း"          "မှု"            "တွေ"           "ကျူးလွန်"        "ခဲ့"           
[101] "ဖူး"           "တယ်"           "*လို့"           "စီ"            "အိုင်"          
[106] "အေ"           "*ရဲ့"           "အ"            "ကြီး"          "အကဲ"          
[111] "ဟောင်း"        "*ဘတ်ဇ်"         "ခ"            "*ရောရှ့်"        "*ဂတ်"         
[116] "ကဘီ"           "ဘီ"            "စီ"            "ကို"            "ပြော"        
[121] "ခဲ့"            "ပါ"           "တယ်"           "*။"           ""            
[126] "တောင်"         "ကို"            "*ရီး"          "ယား"          "အခြေ"        
[131] "စိုက်"           "*PoscoDaewoo" "*နှင့်"          "*သြ"          "စ"           
[136] "*တြေး"        "လျ"           "အခြေ"         "စိုက်"           "*Woodside"   
[141] "တို့"            "အကျိုး"         "တူ"            "ပူး"           "ပေါင်း"       
[146] "ဆောင်ရွက်"       "နေ"           "*သည့်"          "ရခိုင်"          "ကမ်းလွန်"       
[151] "ရှိ"            "*AD-7"        ""             "*၂၀၁၈"        "ခုနှစ်"         
[156] "အာ"           "ရှ"            "အားက"         "စား"          "ပြိုင်"         
[161] "ပွဲ"            "တွင်"           "အားက"         "စား"          "နည်း"         
[166] "အရေ"          "အ"            "တွက်"           "တိုး"           "*မြင့်"        
[171] "လာ"           "ခဲ့"            "ပိ"            "*ဿာ"          "*ချိန်"        
[176] "*၁၀"          "သား"          "ရှိ"            "သော"          "ကြက်"         
[181] "သား"          "များ"         "*ချက်"         "ပြုတ်"          "ကျွေးမွေး"     
[186] "လှူဒါန်း"        "သွား"          "*သည့်"          "အ"            "တွက်"          
[191] "ကျေးဇူးတင်"     "ပါ"           "သည်"           "*။"           ""            
Run word segmentation; matching with words in Wikitionary (26,729 words)
The code is exactly the same as previous with only the dict changed to "wikitionary_1".

Show the results of word segmentation using Wikitionary data
utf8::utf8_print(unlist(wordWK))
  [1] "ဝန်ကြီးချုပ်"     "ဦး"           "ဖြိုး"          "မင်း"          "သိန်း"         
  [6] "ခွင့်"           "ထပ်"           "တိုင်"           "ရန်"           "ကုန်"          
 [11] "တိုင်း"          "လွှတ်တော်"        "မှာ"           "*YBS"         "စနစ်"         
 [16] "ပြုပြင်"        "ပြောင်းလဲ"      "ဖို့"            "တင်သွင်း"        "တဲ့"           
 [21] "အဆို"           "အတည်ပြု"        "ဖို့"            "*၊"           "မ"           
 [26] "ပြု"           "ဖို့"            "ဆုံးဖြတ်"        "*မယ့်"          "မနက်ဖြန်"      
 [31] "လွှတ်တော်"        "အစည်းအဝေး"     "ကို"            "လည်း"          "ဝန်ကြီးချုပ်"    
 [36] "ဦး"           "ဖြိုး"          "မင်း"          "သိန်း"          "က"           
 [41] "ခွင့်"           "ထပ်"           "တိုင်"           "ခဲ့"            "ကြောင်း"      
 [46] "သိရ"           "ပါ"           "တယ်"           "စီ"            "အိုင်"          
 [51] "အေ"           "က"            "နှိပ်စက်"         "စစ်ဆေး"        "မှု"           
 [56] "တွေ"           "လုပ်"           "ခဲ့"            "အမေ"          "ရိ"           
 [61] "ကန်"           "*-"           "ဗဟို"           "ထောက်လှမ်းရေး"   "ဌာန"         
 [66] "*CIA"         "ဟာ"           "သမ္မတ"         "ဟောင်း"        "*ဂျော့ချ်"     
 [71] "*ဘုရှ်"          "လက်ထက်"         "စက်တင်ဘာ"       "*၁၁"          "ရက်"          
 [76] "တိုက်ခိုက်"         "ခံရ"           "မှု"            "နောက်ပိုင်း"      "စစ်ဆေး"       
 [81] "မှု"            "တွေ"           "လုပ်"           "ရာ"           "မှာ"          
 [86] "နှိပ်စက်"         "ညှင်း"          "ပန်း"          "မှု"            "တွေ"          
 [91] "ကျူးလွန်"        "ခဲ့"            "ဖူး"           "တယ်"           "လို့"           
 [96] "စီ"            "အိုင်"           "အေ"           "ရဲ့"            "အကြီးအကဲ"      
[101] "ဟောင်း"        "*ဘတ်ဇ်"         "ခ"            "*ရောရှ့်"        "*ဂတ်"         
[106] "ကဘီ"           "ဘီစီ"           "ကို"            "ပြော"         "ခဲ့"           
[111] "ပါ"           "တယ်"           "*။"           ""             "တောင်"        
[116] "ကို"            "*ရီး"          "ယား"          "အခြေစိုက်"       "*PoscoDaewoo"
[121] "နှင့်"           "သြ"           "စ"            "*တြေး"        "လျ"          
[126] "အခြေစိုက်"       "*Woodside"    "တို့"            "အကျိုး"         "တူ"           
[131] "ပူးပေါင်း"      "ဆောင်ရွက်"       "နေ"           "သည့်"           "ရခိုင်"         
[136] "ကမ်းလွန်"        "ရှိ"            "*AD-7"        ""             "*၂၀၁၈"       
[141] "ခုနှစ်"          "အာရှ"          "အားကစား"      "ပြိုင်ပွဲ"         "တွင်"          
[146] "အားကစား"      "နည်း"          "အရေ"          "အတွက်"          "တိုး"          
[151] "မြင့်"          "လာ"           "ခဲ့"            "ပိ"            "*ဿာ"         
[156] "ချိန်"          "*၁၀"          "သား"          "ရှိ"            "သော"         
[161] "ကြက်သား"       "များ"         "ချက်ပြုတ်"       "ကျွေးမွေး"      "လှူဒါန်း"       
[166] "သွား"          "သည့်အတွက်"        "ကျေးဇူးတင်"     "ပါ"           "သည်"          
[171] "*။"           ""           
The above exercise tried to find out if naive coding is possible for word segmentation of Myanmar text. I guess the answer could be, timidly, yes.  And it seems obvious that the approach used here hinges entirely on the quality and completeness of the word lists. I have no knowledge of the field, but at the beginning I felt a word-list of only ten or twenty thousand would be quite inadequate. Yet their contribution to word boundary identification seems quite respectable and you could clearly see significant improvements in word identification attributable to the larger Wikitionary list against the smaller Kanaung list. In my word segmentation I've marked the base-syllable (first syllable) that could not be found in a given word list with a “*”. This may be quite helpful in understanding and locating the “incompleteness” for a given word list.
Remarks on the procedural aspects of the word segmentation itself: I’ve ignored the practice of leaving out stop-words, punctuation, blanks, and other less meaningful textual data from the original syllable data. I guess it could be called a conservative syllable list because of that. So also the words produced from it in turn became a conservative word list. They are ugly, but this way, they may be quite helpful for making adjustments and improvements.

No comments:

Post a Comment