Sunday, July 28, 2019

Cycle 3: Wordcloud of 306K sentence-endings


Summary
This wordcloud doesn’t look as interesting as my earlier ones based on smaller number of sentences. The reason is that the formal and standard sentence ending of သည် dominate the field so much that others practically vanished on the wordcloud if frequencies were drawn up proportionately!
The following plot illustrates this fact.

Looking at one “word” sentence-endings

We have got the entire 306,290 sentences tokenized into “words” using “quanteda”. We’d looked at two-word sentence-endings in my last post. Now we’ll look at the one-word version.
We now extract one-word sentence endings from x100_itNS5.w_paliN.
library(quanteda)
library(stringr)
system.time(
  x100_iwpN.2 <- sapply(x100_itNS5.w_paliN, paste0, collapse=" ") %>%
    word(.,-2,-1)
)
   user  system elapsed 
  18.77    0.03   18.91 
str(x100_iwpN.2)
 chr [1:306290] "<U+1015><U+102B> <U+104B>" ...
cat(x100_iwpN.2[c(1, 100000, 200000, 306290)])
ပါ ။ သည် ။ သည် ။ ကွယ်လွန်သည် ။
We create dfm from it so that frequencies of distinct one-word sentence-ending are available.
system.time(
  x100_iwpN2.dfm <- tokens(x100_iwpN.2,ngrams=2,concatenator = "-") %>%
    dfm(.)
)
   user  system elapsed 
  10.76    0.08   10.67 
x100_iwpN2.dfm
Document-feature matrix of: 306,290 documents, 2,507 features (100.0% sparse).
We look at sentence-endings with the top-200 frequencies
x100SE2_tf <- textstat_frequency(x100_iwpN2.dfm)[,1:2]  %>%
   .[order(-.$frequency,.$feature),]
utf8::utf8_print(do.call("paste", c(sep= " = ", x100SE2_tf[1:200,])))
  [1] "သည်-။ = 161036"      "ဖြစ်သည်-။ = 32375"    "၏-။ = 24161"       
  [4] "ပါသည်-။ = 12926"     "တယ်-။ = 8567"        "နိုင်သည်-။ = 6276"     
  [7] "ထားသည်-။ = 5075"     "ပေ-။ = 4674"        "ချေ-။ = 3816"      
 [10] "မည်-။ = 3655"        "ပါ-။ = 3085"        "ဆိုသည်-။ = 2095"      
 [13] "ပြန်သည်-။ = 2067"     "စေသည်-။ = 2061"      "နေသည်-။ = 1924"     
 [16] "တတ်သည်-။ = 1757"      "တော့သည်-။ = 1750"     "ခေါ်သည်-။ = 1687"    
 [19] "တည်ရှိသည်-။ = 1551"     "ပေးသည်-။ = 1088"     "ပြီ-။ = 1016"       
 [22] "လိမ့်မည်-။ = 807"       "ဘူး-။ = 715"         "မယ်-။ = 488"        
 [25] "များသည်-။ = 483"     "ပဲ-။ = 477"          "ရ-။ = 456"         
 [28] "ခြင်း-။ = 447"       "ဖူးသည်-။ = 443"       "တည်း-။ = 420"       
 [31] "မွေးဖွားသည်-။ = 330"   "ပြုသည်-။ = 320"       "တူသည်-။ = 311"       
 [34] "ဆိုလိုသည်-။ = 308"       "ကွယ်လွန်သည်-။ = 300"     "ပြသည်-။ = 250"      
 [37] "ခဲ့-။ = 228"          "ပေါက်ရောက်သည်-။ = 190" "များ-။ = 179"      
 [40] "နေထိုင်သည်-။ = 177"     "ရှိ-။ = 172"          "ရန်-။ = 166"        
 [43] "ဖော်ပြသည်-။ = 165"    "ပြုလုပ်သည်-။ = 164"     "ကျင်းပသည်-။ = 159"   
 [46] "သတည်း-။ = 159"       "ပြောသည်-။ = 157"     "ကြည့်ပါ-။ = 145"     
 [49] "မဟုတ်-။ = 145"        "လေ-။ = 144"         "ကြားသည်-။ = 141"    
 [52] "ထမ်းဆောင်သည်-။ = 141"  "စေ-။ = 135"         "နှင့်-။ = 134"        
 [55] "ကြောင်း-။ = 132"     "ပေါ့-။ = 130"        "ရှာသည်-။ = 127"      
 [58] "မိသည်-။ = 126"        "တိုက်ခိုက်သည်-။ = 120"     "တကား-။ = 118"      
 [61] "သေးပါ-။ = 117"      "ထွက်-။ = 116"         "ကြေညာသည်-။ = 114"   
 [64] "စိုက်ပျိုးသည်-။ = 114"    "ကျန်ရစ်သည်-။ = 109"    "တည်ထောင်သည်-။ = 107"  
 [67] "ဒေသဖြစ်သည်-။ = 104"   "နိုင်-။ = 102"         "တွင်သည်-။ = 97"       
 [70] "နည်း-။ = 96"         "ဆင်တူသည်-။ = 91"       "စီရင်သည်-။ = 88"      
 [73] "ပုံ-။ = 86"           "ပေါက်သည်-။ = 86"      "ကို-။ = 84"          
 [76] "စို့-။ = 83"           "ယော-။ = 82"         "လား-။ = 82"        
 [79] "ဘုရား-။ = 80"        "ဆည်းပူးသည်-။ = 76"     "တင်ပြသည်-။ = 76"     
 [82] "အံ့-။ = 76"           "ချီးမြှင့်သည်-။ = 75"    "တံ့-။ = 75"          
 [85] "ဖြစ်ထွန်းသည်-။ = 75"    "သေး-။ = 75"         "ပင်-။ = 74"         
 [88] "ပွင့်သည်-။ = 73"        "စီးဆင်းသည်-။ = 72"     "ဘဲ-။ = 72"          
 [91] "၊-။ = 72"           "ကုန်သည်-။ = 71"        "တာ-။ = 70"         
 [94] "ကျသည်-။ = 68"        "လေး-။ = 68"         "လော့-။ = 67"        
 [97] "ကြ-။ = 65"          "သနည်း-။ = 64"        "ဆောင်သည်-။ = 63"     
[100] "ဖွဲ့စည်းသည်-။ = 62"      "ခဲသည်-။ = 61"         "တဲ့-။ = 59"          
[103] "ဖြစ်ပွားသည်-။ = 58"    "ခြားနားသည်-။ = 57"   "မြို့-။ = 56"         
[106] "အောင်သည်-။ = 56"      "တက်သည်-။ = 54"        "တော့-။ = 54"        
[109] "ခံယူသည်-။ = 53"        "ထွန်းကားသည်-။ = 52"    "လော-။ = 52"        
[112] "စားသည်-။ = 51"       "နန်းတက်သည်-။ = 50"     "လိုက်ပါ-။ = 50"       
[115] "စတင်သည်-။ = 49"       "ထုတ်သည်-။ = 49"        "လဲ-။ = 48"          
[118] "ပြုစုသည်-။ = 47"       "ဖြစ်ပေါ်သည်-။ = 47"    "တိုက်သည်-။ = 46"       
[121] "ပို့သည်-။ = 46"         "မြင်သည်-။ = 46"       "င်း-။ = 45"         
[124] "ထုတ်ပေးသည်-။ = 45"     "ပညာသင်သည်-။ = 44"     "များပြားသည်-။ = 44" 
[127] "ဆုံးဖြတ်သည်-။ = 43"     "ည်-။ = 42"           "သတဲ့-။ = 42"         
[130] "တက်ရောက်သည်-။ = 41"    "တာဝန်ရှိသည်-။ = 41"     "ကျက်စားသည်-။ = 40"   
[133] "တင်သည်-။ = 40"        "ဖြစ်-။ = 40"         "ပစ်သည်-။ = 39"       
[136] "သူ-။ = 39"           "ဉာဏ်-။ = 38"         "နော်-။ = 38"        
[139] "ပစ်ရသည်-။ = 37"       "ပေးအပ်သည်-။ = 37"     "ဝေ-။ = 37"         
[142] "နှစ်သက်သည်-။ = 36"      "ပြု-။ = 36"          "တိ-။ = 35"          
[145] "ကွာခြားသည်-။ = 34"    "စီးသည်-။ = 34"        "စေသတည်း-။ = 34"     
[148] "ညီမျှသည်-။ = 34"       "ထင်သည်-။ = 34"        "ရေး-။ = 34"        
[151] "ဝ်-။ = 34"           "ညျ-။ = 33"          "မေးသည်-။ = 33"      
[154] "ရာ-။ = 33"          "ဟုတ်-။ = 33"          "ကုန်-။ = 32"         
[157] "နီးစပ်သည်-။ = 32"      "ပြီး-။ = 32"         "သော-။ = 32"        
[160] "ဖြစ်ကြောင်း-။ = 31"   "စသည်-။ = 30"         "တင်သွင်းသည်-။ = 30"    
[163] "တွေ့သည်-။ = 30"        "မှု-။ = 30"           "လည်းကောင်း-။ = 30"   
[166] "ခိုင်းသည်-။ = 29"       "စစ်ဆေးသည်-။ = 29"     "စဉ်-။ = 29"         
[169] "တည်ဆောက်သည်-။ = 29"    "မည်သည်-။ = 29"        "မှာ-။ = 29"         
[172] "တင်မြှောက်သည်-။ = 28"   "ပြီးသည်-။ = 28"       "ကြိုးစားသည်-။ = 27"   
[175] "ထုတ်ပြန်သည်-။ = 27"     "ကိုးကွယ်သည်-။ = 26"      "ညီသည်-။ = 26"        
[178] "တွေ့ရှိသည်-။ = 26"       "ထည့်သည်-။ = 26"        "နဲ့-။ = 26"          
[181] "ပို့ချသည်-။ = 26"       "ဦး-။ = 26"          "ပေါ်သည်-။ = 25"      
[184] "ယ့်-။ = 25"           "ကြည့်သည်-။ = 24"       "တင်ပို့သည်-။ = 24"      
[187] "တော်စပ်သည်-။ = 24"     "ဖွင့်လှစ်သည်-။ = 24"      "ဗျာ-။ = 24"        
[190] "ချီတက်သည်-။ = 23"      "ဆန့်ကျင်သည်-။ = 23"     "ထူထောင်သည်-။ = 23"    
[193] "ပျက်စီးသည်-။ = 23"     "ပြောင်းသည်-။ = 23"    "ရရှိ-။ = 23"         
[196] "သလား-။ = 23"        "ချွန်သည်-။ = 22"       "စည်ကားသည်-။ = 22"    
[199] "ဖွင့်သည်-။ = 22"        "ယာ-။ = 22"         
The percentage of sentence-endings with the lowest frequency of 1 is 52:
nrow(x100SE2_tf[which(x100SE2_tf$frequency==1)])*100/nrow(x100SE2_tf)
[1] 51.97447
We now look at the bottom 100.
utf8::utf8_print(do.call("paste", c(sep= " = ", x100SE2_tf[2408:2507,])))
  [1] "အပြား-။ = 1"        "အပြီး-။ = 1"         "အပြေး-။ = 1"       
  [4] "အပွား-။ = 1"         "အဖိုး-။ = 1"          "အဖောက်-။ = 1"       
  [7] "အဖြစ်အပျက်-။ = 1"     "အဖွဲ့-။ = 1"           "အမည်ပေးသည်-။ = 1"    
 [10] "အမယ်-။ = 1"          "အမိန့်-။ = 1"          "အမိန့်တော်-။ = 1"      
 [13] "အမုန်း-။ = 1"         "အမူ-။ = 1"           "အများ-။ = 1"       
 [16] "အမြှောက်-။ = 1"       "အယူ-။ = 1"           "အရက်-။ = 1"         
 [19] "အရင်းအနှီး-။ = 1"      "အရပ်-။ = 1"          "အရသာ-။ = 1"        
 [22] "အရိမေတ္တယျဘုရား-။ = 1" "အရိုး-။ = 1"          "အရောင်-။ = 1"       
 [25] "အရေးယူ-။ = 1"        "အလင်းပြ-။ = 1"       "အလာ-။ = 1"         
 [28] "အလီ-။ = 1"           "အလုပ်ခွင်-။ = 1"        "အလုပ်များ-။ = 1"     
 [31] "အလံ-။ = 1"           "အလျောက်-။ = 1"       "အလွန်-။ = 1"         
 [34] "အဝန်း-။ = 1"         "အဝှန်း-။ = 1"         "အသိ-။ = 1"          
 [37] "အသိုင်းအဝိုင်း-။ = 1"     "အသံ-။ = 1"           "အသံကြိုး-။ = 1"       
 [40] "အာခေါင်-။ = 1"       "အာပေါဓာတ်-။ = 1"     "အာရုံ-။ = 1"         
 [43] "အာရုံစိုက်-။ = 1"        "အာသာ-။ = 1"         "အားထုတ်စေသည်-။ = 1"   
 [46] "အားလျော်စွာ-။ = 1"    "အားျ-။ = 1"         "အိပ်ချင်မူးတူး-။ = 1"   
 [49] "အိမ်ခြေရာခြေ-။ = 1"   "အိမ်မိုးခြင်တင်-။ = 1"    "အိုင်-။ = 1"          
 [52] "အိုင်ချင်း-။ = 1"       "အိုင်း-။ = 1"          "အိုးစောင်ခြမ်း-။ = 1"  
 [55] "အီ'ဖြစ်သည်-။ = 1"      "အုတ်ခုံ-။ = 1"          "အုန်းခွံရောင်-။ = 1"    
 [58] "အုန်းဆံခြည်ထွေး-။ = 1"   "အုန်းသီးဆန်ခြောက်-။ = 1" "အုပ်-။ = 1"          
 [61] "အုပ်စု-။ = 1"          "အုပ်ထိန်းသူ-။ = 1"       "အူ-။ = 1"           
 [64] "အူသိမ်-။ = 1"          "အူး-။ = 1"           "အောက်တိုဘာ-။ = 1"     
 [67] "အောင်သူ-။ = 1"        "အေ့-။ = 1"           "အဲပေါ့'ဟုခေါ်သည်-။ = 1" 
 [70] "အံ့မခန်း-။ = 1"        "အျော-။ = 1"         "အရောင်ဖျော့သည်-။ = 1" 
 [73] "ဥစ္စာ-။ = 1"         "ဥပါယ်တံမျဉ်-။ = 1"     "ဥယျာဉ်မှူး-။ = 1"     
 [76] "ဦးမင်း-။ = 1"        "ဧရာ-။ = 1"          "ဧရိယာ-။ = 1"        
 [79] "ဩစတြေးလျ-။ = 1"     "ဩဘာ-။ = 1"          "ိ-။ = 1"            
 [82] "ံ-။ = 1"             "း-။ = 1"            "်-။ = 1"            
 [85] "၁၂ရက်-။ = 1"         "၁၉၂၈ခု-။ = 1"        "၂ဖြစ်သည်-။ = 1"      
 [88] "၂သည်-။ = 1"          "၂၀-ဖြစ်သည်-။ = 1"     "၂၀၀၂-ခု-။ = 1"      
 [91] "၂၄ဥဒါဟရုဏ်-။ = 1"     "၃၈-ပါး-။ = 1"       "၄-မျိုး-။ = 1"       
 [94] "၅၂၀,၅၉၁ဖြစ်သည်-။ = 1" "၆-ယောက်-။ = 1"       "၆-သွယ်-။ = 1"        
 [97] "၊်-။ = 1"            "၏်-။ = 1"            "artဖြစ်သည်-။ = 1"    
[100] "loadဖြစ်သည်-။ = 1"   

Creating the “one-word” wordcloud

To draw a wordcloud with Myanmar language characters shown with the Unicode “Myanmar3” font we need to import the desired font using the extrafont package, and then register it and load it for use. That has been done earlier with the following code:
library(extrafont)
font_import(pattern="Myanmar3.ttf")
loadfonts(device="win")
Draw the wordcloud:
library(extrafont)
set.seed(2307)
textplot_wordcloud(x100_iwpN2.dfm, font = "Myanmar3", min_size = .9, max_size = 13, min_count = 30, color = RColorBrewer::brewer.pal(8, "Dark2"))

Workaround for extracting one-syllable sentence-endings

The above has been based on tokenization into “words” by using the “quanteda” package. Before, I had explored the sentence endings using only the last syllable. You will recall that I resorted to quanteda “word” because it would take impossibly long to syllabilize using my own application. Now I got an idea to dodge this problem by using regex. For example, I could substitute ..သည် with သည်. From the 200 highest frequency sentence-endings shown above, I think I should handle ..သည်..မည်..ပါ..နည်း..တည်း..တဲ့, and ..ကြောင်း this way. On the other hand, the two-syllable တကား would need to be retained as only one would be meaningless. Also, there were a number of suspect endings, but as I am displaying the sentence-endings in the wordcloud, small frequencies wouldn’t be visible anyway and therefore safely ignorable.
# create patterns of text to search and replace
t1 <- c(".+သည်", ".+မည်", ".+ပါ", ".+နည်း", ".+တည်း", ".+တဲ့", ".+ကြောင်း")
t2 <- c("သည်", "မည်", "ပါ", "နည်း", "တည်း", "တဲ့", "ကြောင်း")

# searach and replace
names(t2) <- t1        
x100_iwpN.2_1 <- str_replace_all(x100_iwpN.2, t2)
system.time(
  x100_iwpN21.dfm <- tokens(x100_iwpN.2_1,ngrams=2,concatenator = "-") %>%
    # tokens_replace(., t1, t2, valuetype = "regex") %>%
    dfm(.)
)
   user  system elapsed 
  12.33    0.09   12.34 
x100_iwpN21.dfm
Document-feature matrix of: 306,290 documents, 1,627 features (99.9% sparse).
# view frequencies of top 100 sentence-endings
x100SE2_tf1 <- textstat_frequency(x100_iwpN21.dfm)[,1:2]  %>%
   .[order(-.$frequency,.$feature),]
utf8::utf8_print(do.call("paste", c(sep= " = ", x100SE2_tf1[1:100,])))
  [1] "သည်-။ = 244706"   "၏-။ = 24161"     "တယ်-။ = 8567"    
  [4] "ပေ-။ = 4674"     "မည်-။ = 4474"     "ချေ-။ = 3816"   
  [7] "ပါ-။ = 3402"     "ပြီ-။ = 1016"     "ဘူး-။ = 715"     
 [10] "တည်း-။ = 616"     "မယ်-။ = 488"      "ပဲ-။ = 477"      
 [13] "ရ-။ = 456"       "ခြင်း-။ = 447"    "ခဲ့-။ = 228"      
 [16] "များ-။ = 179"    "ရှိ-။ = 172"       "ကြောင်း-။ = 168" 
 [19] "ရန်-။ = 166"      "နည်း-။ = 162"     "စေ-။ = 145"     
 [22] "မဟုတ်-။ = 145"     "လေ-။ = 144"      "နှင့်-။ = 134"     
 [25] "ပေါ့-။ = 130"     "တကား-။ = 118"    "ထွက်-။ = 116"     
 [28] "နိုင်-။ = 102"      "တဲ့-။ = 101"       "ပုံ-။ = 86"       
 [31] "ကို-။ = 84"        "စို့-။ = 83"        "ယော-။ = 82"     
 [34] "လား-။ = 82"      "ဘုရား-။ = 80"     "အံ့-။ = 76"       
 [37] "တံ့-။ = 75"        "သေး-။ = 75"      "ပင်-။ = 74"      
 [40] "ဘဲ-။ = 72"        "၊-။ = 72"        "တာ-။ = 70"      
 [43] "လေး-။ = 68"      "လော့-။ = 67"      "ကြ-။ = 65"      
 [46] "မြို့-။ = 56"       "တော့-။ = 54"      "လော-။ = 52"     
 [49] "လဲ-။ = 48"        "င်း-။ = 45"       "ပါး-။ = 43"     
 [52] "ည်-။ = 42"        "ဖြစ်-။ = 40"      "သူ-။ = 39"       
 [55] "ဉာဏ်-။ = 38"      "နော်-။ = 38"      "ဝေ-။ = 37"      
 [58] "ပြု-။ = 36"       "တိ-။ = 35"        "ရေး-။ = 34"     
 [61] "ဝ်-။ = 34"        "ညျ-။ = 33"       "ရာ-။ = 33"      
 [64] "ဟုတ်-။ = 33"       "ကုန်-။ = 32"       "ပြီး-။ = 32"     
 [67] "သော-။ = 32"      "မှု-။ = 30"        "လည်းကောင်း-။ = 30"
 [70] "စဉ်-။ = 29"       "မှာ-။ = 29"       "နဲ့-။ = 26"       
 [73] "ဦး-။ = 26"       "ယ့်-။ = 25"        "ဗျာ-။ = 24"     
 [76] "ရရှိ-။ = 23"       "သလား-။ = 23"     "ယာ-။ = 22"      
 [79] "တရား-။ = 21"     "နှစ်-။ = 20"       "မျိုး-။ = 20"     
 [82] "ပြန်-။ = 19"      "သလဲ-။ = 19"       "ကောင်း-။ = 18"   
 [85] "ခု-။ = 18"        "တော်-။ = 18"      "ဗျ-။ = 18"      
 [88] "လို-။ = 18"        "သောင်း-။ = 18"    "ရဲ့-။ = 17"       
 [91] "ကြီး-။ = 16"      "ချက်-။ = 16"      "ဆောင်-။ = 16"    
 [94] "လို့-။ = 16"        "ဝင်-။ = 16"       "ကိုး-။ = 15"      
 [97] "သာ-။ = 15"       "တည့်-။ = 14"       "တု-။ = 14"       
[100] "ကြား-။ = 13"    

Creating the one-syllable wordcloud

set.seed(2307)
textplot_wordcloud(x100_iwpN21.dfm, font = "Myanmar3", min_size = 1, max_size = 10, min_count = 8, color = RColorBrewer::brewer.pal(8, "Dark2"))
You could look closely at the plot above and complain that it still got “one-word” sentence endings. Yes, I could eliminate them by writing a very long substitution code (too hard) or raise the min_count(easy). But the latter option would make the plot look less good!

Code for the barplot at the top of this page

Running the plot and saving it to a graphics file:
library(extrafont)
png("freq10.png")
barplot(x100SE2_tf1$frequency[1:10],names.arg=x100SE2_tf1$feature[1:10], main = "Frequencies of top 10 sentence-endings", cex.main = .9, cex.names=0.75, family="Myanmar3", col = "pink", cex.axis = 0.7, ylim = c(0, 250000), xlab = "Feature", ylab = "Frequency", cex.lab = 0.8)
dev.off()