Sunday, January 7, 2024

Casual analysis of Lindemann’s Burmese Wikipedia corpus - 1


Lindemann’s post about his corpus project gives the length of Burmese corpus as 75,527 words. Actually they are phrases, as only phrases are separated by space in Burmese text. Not words.

His corpus is a very long single string of text. To facilitate analysis, I segment it into phrases. The vector of phrases is 75,527 long.

# read Lindemann's corpus
Lindemann_doc <-readLines("Burmese.txt")
library(quanteda)
Lindemann_doc_phrase1 <- char_segment(Lindemann_doc, "\\s", valuetype = "regex", remove_pattern = FALSE)
table(nzchar(Lindemann_doc_phrase1))

 TRUE 
75527 
Lindemann_doc_phrase1[c(1,75527)]
[1] "အီလက်ထရွန်းနစ်ဆိုသည်မှာ" "ရည်မှန်းထားပါသည်"  

I split the phrases into syllables.

library(stringr)
Lindemann_doc_phrase1_syl <- str_replace_all(Lindemann_doc_phrase1, "([က-အဣ-ဧဩဪဿ၌-၏])", "-\\1") %>% 
  str_replace_all(., "-", " ") %>% 
  str_replace_all(., "\\s([က-အ][့်း]\\s|[က-အ][့်း])", "\\1") %>% 
  str_replace_all(., "\\s([က-အ]္)\\s", "\\1")  %>% 
  str_replace_all(., "(\\s[က-အ]င်္)\\s", "\\1")  %>% 
  str_replace_all(., "\\s(ဿ)", "\\1") %>%
  str_remove_all(., "[a-zA-Z0-9၀-၉၊။\\[\\]\\(\\)]|^\\s") %>% 
  str_squish(.)

View first and last phrases segmented into syllables.

length(Lindemann_doc_phrase1_syl)
[1] 75527
Lindemann_doc_phrase1_syl[c(1,75527)]
[1] "အီ လက် ထ ရွန်း နစ် ဆို သည် မှာ" "ရည် မှန်း ထား ပါ သည်"     

Find the distinct syllables in the Landemann’s corpus

I (i) Tokenize the text that have been segmented into syllables using space as word boundary, (ii) use dfm() function to get the matrix of counts of distinct syllables by phrases.

# tokenize
Lindemann_doc_phrase1_syl_tok <- tokens(Lindemann_doc_phrase1_syl, what = "fastestword")

# find total number of syllables
sum(ntoken(Lindemann_doc_phrase1_syl_tok))
[1] 297070
# view syllables in the last phrase
Lindemann_doc_phrase1_syl_tok[75527]
Tokens consisting of 1 document.
text75527 :
[1] "ရည်"  "မှန်း" "ထား" "ပါ"  "သည်" 
# create dfm (document feature matrix)
dfmat_Lindeman_1 <- dfm(Lindemann_doc_phrase1_syl_tok)
dfmat_Lindeman_1
Document-feature matrix of: 75,527 documents, 2,722 features (99.86% sparse) and 0 docvars.
       features
docs    အီ လက် ထ ရွန်း နစ် ဆို သည် မှာ ရွန် များ
  text1 1  1 1   1  1 1  1  1  0    0
  text2 1  1 1   0  0 0  0  0  1    1
  text3 0  0 0   0  0 0  0  0  0    0
  text4 0  0 0   0  0 0  0  0  0    0
  text5 0  0 0   0  0 0  0  0  0    0
  text6 0  0 0   0  0 0  0  0  0    0
[ reached max_ndoc ... 75,521 more documents, reached max_nfeat ... 2,712 more features ]

There is a total of 2,722 distinct syllables in the Lindemann’s Burmese corpus.

Some suspicious syllables

One hundred syllables with highest frequencies are shown below. The numbers shown are frequencies for the corresponding syllables.

tfl <- topfeatures(dfmat_Lindeman_1, n = 100)
df.tfl <- data.frame(name = names(tfl), n = tfl, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " -", df.tfl)))
  [1] "အ -15994"   "သည် -11252"  "များ -6937" "ကို -4895"   
  [5] "တွင် -4379"   "သော -3899"  "ဖြစ် -3732"  "ရှိ -3469"   
  [9] "နှင့် -3197"   "က -3069"    "တို့ -3061"    "မ -2990"   
 [13] "နိုင် -2955"   "ရ -2860"    "သ -2848"    "ခဲ့ -2759"   
 [17] "ရာ -2504"   "ကြ -2203"   "စ -2154"    "ငံ -2121"   
 [21] "တ -1961"    "နေ -1884"   "မှ -1874"    "နှစ် -1791"  
 [25] "ပြီး -1763"  "ခု -1731"    "ပ -1708"    "ရေး -1640" 
 [29] "မျိုး -1614"  "ပြည် -1605"  "တစ် -1599"   "သို့ -1589"   
 [33] "မှု -1556"    "သာ -1527"   "မှာ -1515"   "ကြီး -1509" 
 [37] "လာ -1443"   "လူ -1432"    "လည်း -1406"  "ဖြင့် -1401" 
 [41] "နယ် -1362"   "ဝ -1360"    "မာ -1339"   "စု -1325"   
 [45] "ပါ -1305"   "ခြင်း -1271" "သည့် -1232"   "တော် -1228" 
 [49] "အား -1184"  "ဝင် -1130"   "မြန် -1096"  "ရိ -1081"   
 [53] "သူ -1076"    "ဒေ -1053"   "သား -1051"  "ထား -1019" 
 [57] "တောင် -1006" "လ -1000"    "ထို -989"     "ဟု -979"    
 [61] "ဘာ -979"    "ပင် -976"    "ကား -953"   "အာ -952"   
 [65] "ရေ -951"    "ဦး -946"    "လက် -930"    "လေ -913"   
 [69] "ပိုင်း -912"   "ဆုံး -872"    "မြို့ -872"    "လုပ် -838"   
 [73] "ပြု -818"    "ရန် -800"    "စစ် -799"    "ယ -788"    
 [77] "ဆို -784"     "ခံ -776"     "ကာ -738"    "စာ -733"   
 [81] "ပေါ် -727"   "သုံး -715"    "တိုက် -715"    "နောက် -710" 
 [85] "ဥ -703"     "တိုင်း -691"   "တွင်း -689"   "ပြင် -688"  
 [89] "ရား -666"   "ချုပ် -662"   "စား -661"   "မည် -646"   
 [93] "သစ် -639"    "ယူ -637"     "ခြား -623"  "မြောက် -617"
 [97] "ပေါင်း -605" "မေ -604"    "တိ -602"     "သော် -590"  

They do not show any irregularities. But some in the following 100 least frequent syllables do.

tfl.1 <- topfeatures(dfmat_Lindeman_1, n = 100, decreasing = FALSE)
df.tfl.1 <- data.frame(name = names(tfl.1), n = tfl.1, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " -", df.tfl.1)))
  [1] "နို်င် -1"    "ဇင့် -1"    "မုံ့ -1"     "ဟဿ -1"    "သွေ -1"   
  [6] "ဒြာ -1"   "ဇန္ဒာ -1"  "ခို့ -1"     "ထ် -1"     "သုန် -1"   
 [11] "လွာ -1"    "ပြွန်း -1"  "လ္ဘက် -1"   "တာ့ခ် -1"   "ပျယ် -1"  
 [16] "ချဲ -1"    "ကပ်း -1"   "ဇမ္ဗူ -1"   "ဗဂ္ဂ -1"   "ငုံ့ -1"    
 [21] "မန္တာ -1"  "ဟိန္ဒု -1"   "တြ -1"    "ကင့် -1"    "မလ္လ -1"  
 [26] "ကြိုင် -1"   "ာင် -1"    "ဝင်္ဘာ -1"  "မက္ကင်း -1" "ယှင်း -1"  
 [31] "ချိမ်း -1"  "အဂ္ဂိ -1"   "တျု -1"    "ငုပ် -1"    "ပပ် -1"   
 [36] "ခီး -1"    "လျှစ် -1"   "ညှို့ -1"     "ဆွံ -1"     "သမ္ပိုင်း -1"
 [41] "ကမ္ပည်း -1" "မဉ္ဇူ -1"   "ပစ္စ -1"   "ပုဏ္ဏ -1"   "ချူ -1"   
 [46] "ပုဗ္ဗ -1"   "ညှစ် -1"    "သင်္ကြံ -1"  "ဖျန်း -1"  "ဗွေ -1"   
 [51] "ငှန်း -1"   "ပုသ် -1"    "မိဿံ -1"    "လိန် -1"    "ပိဿာ -1"  
 [56] "ဝဏ် -1"    "မန္ဈ -1"   "ညိမ်း -1"   "ဝြ -1"    "ပည် -1"   
 [61] "လွှ -1"     "ဣဋ္ဌ -1"   "ညီး -1"    "ရေ့ -1"    "ျှင် -1"   
 [66] "ကြော်း -1" "မျး -1"   "နန္တ -1"   "ဂုဏ္ဏ -1"   "ပုပ္ပား -1"
 [71] "ဥက္ကံ -1"   "သာဒ် -1"   "ဓော -1"   "နိစ္ဆ -1"   "သုန္ဒ -1"  
 [76] "ဒုမ္မ -1"   "ရှိည် -1"    "ခြင့် -1"   "တွ့ -1"     "ညွန်း -1"  
 [81] "မွန့် -1"    "လု့း -1"    "မက္ကင် -1"  "ဟန္နာ -1"  "ဟင်္နာ -1" 
 [86] "ဆံခ် -1"    "ကိုမ် -1"    "သူ့သ် -1"    "ကိုခ် -1"    "ယိုးလ် -1"  
 [91] "သ် -1"     "ဝားခ် -1"  "ယိုးခ် -1"   "သူန် -1"    "သားဖ် -1" 
 [96] "ဇူးပ် -1"   "တုံ့ပ် -1"    "အဆ် -1"    "အေမ် -1"   "ထူးမ် -1"  

I have marked the suspicious ones below. Some may be part of foreign names, though. Bottom one hundred syllables fromLindemann’s corpus
I look for the phrases that contain the suspicious syllables highlighted in last three lines of the above output.

grep("\\sဆံခ်|\\sကိုမ်|\\sသူ့သ်|\\sကိုခ်|\\sယိုးလ်|\\sသ်|\\sဝားခ်|\\sယိုးခ်|\\sသူန်|\\sသားဖ်|\\sဇူးပ်|\\sတုံ့ပ်|\\sအဆ်|\\sအေမ်|\\sထူးမ်", paste0(" ", Lindemann_doc_phrase1_syl), value = TRUE) %>% sub("^\\s", "", .)
[1] "ဆံခ် ရည် တ ပင် ကိုမ် ယ် ဟ"    "သူ့သ် ဝား ကိုခ် ယိုးလ် ယ် ဟင်"  
[3] "ကိုယ် သ် ဝားခ် ယိုးခ် ရင်း ကို"  "ထို သူန် ဟင့် တက် ဝ"        
[5] "သားဖ် ရစ် လို သော"        "က် ယေး ဇူးပ် ရုက် ရ လော့"  
[7] "ခ် ယစ် တုံ့ပ် ရုလ် ယ် ဟင်"      "အဆ် ဝ အေမ် ယိုး တို့ အား သာ"
[9] "အ ဘယ် သို့ ထူးမ် ရတ် သ နည်း" 

I retrieve the corresponding phrases.

idx_err_syl <- grepl("\\sဆံခ်|\\sကိုမ်|\\sသူ့သ်|\\sကိုခ်|\\sယိုးလ်|\\sသ်|\\sဝားခ်|\\sယိုးခ်|\\sသူန်|\\sသားဖ်|\\sဇူးပ်|\\sတုံ့ပ်|\\sအဆ်|\\sအေမ်|\\sထူးမ်", paste0(" ", Lindemann_doc_phrase1_syl)) %>% which(.)

Lindemann_doc_phrase1[idx_err_syl]
[1] "ဆံခ်ရည်တပင်ကိုမ်ယ်ဟ"    "သူ့သ်ဝားကိုခ်ယိုးလ်ယ်ဟင်"  "ကိုယ်သ်ဝားခ်ယိုးခ်ရင်းကို"
[4] "ထိုသူန်ဟင့်တက်ဝ"       "သားဖ်ရစ်လိုသော"     "က်ယေးဇူးပ်ရုက်ရလော့" 
[7] "ခ်ယစ်တုံ့ပ်ရုလ်ယ်ဟင်"     "အဆ်ဝအေမ်ယိုးတို့အားသာ" "အဘယ်သို့ထူးမ်ရတ်သနည်း" 

To find out if the suspicious phrases exist in the original Burmese Wikipedia article(s), I searched “ဆံခ်ရည်တပင်ကိုမ်ယ်ဟ”, for example, on the Burmese Wikipedia site. But there were no results matching the query. But from my experience working with the Burmese Wikipedia dump file, I guessed that it might be from “Sermon on the mount” or “တောင်းပေါ်ဒေသနာ” article. I open that article and search for the suspicious phrases and sure enough, they are there.
All of them are in the two paragraphs under the “မကျိန်ဆိုနှင့်။ ရန်သူကိုချစ်ပါ” sub-heading.

Suspicious phrases identified in the ‘Sermon on the Mount’ article
Suspicious phrases identified in the ‘Sermon on the Mount’ article


From this I am quite sure that, except in the case of the erroneous syllable “င်း” arising out of the cleaning process in Lindemann’s corpus, all the suspicious syllables would exist as it is in the original Wikidpedia articles.