Tuesday, April 9, 2019

Wordcloud from 1.2m syllables


Now that I’ve got the syllables segmented from a sample of 10,000 Myanmar Wikipedia text, the next logical step is to define words out of them, I guess. After that I would like to look into the detection of sentiments and emotions from Myanmar language text. I can guess that the journey from working on syllables to emotion detection would be long and torturous, and I have no idea at all what that would involve. Anyway, I anticipate an enjoyable learn as I go journey.
As of now, I am interested in looking at which syllables are most frequently found in my sample corpus. Such tasks must have been formidable in the pre-computer, pre-NLP days, especially relation to large corpora. Actually, I choose to do it because it is a relatively easy task, given the NLP software available now-a-days.
In one of my earlier post, I demonstrated the use of quanteda R package in creating syllabes and then a wordcloud. I am using the same approach here. First, I need to create a quanteda version of corpus from text vectors. Now the syllables I’ve created (see my last post) is a “list” of 10,000 character vectors.
This is converted to a quanteda corpus,
syll.c <- corpus(t(data.frame(lapply(syll, paste0, collapse = " "))))
and tokenized into words. Since each syllables we have formed earlier were delimited by space (as for “words” in English language) we can treat our syllables as words for the purpose of using the standard NLP software.
syll.tfw <- tokens(syll.c, what = "fasterword")
To create a wordcloud quanteda requires our tokenized syllabels converted to a “dfm”.
dfmsyll.tfw <- dfm(syll.tfw)
str(dfmsyll.tfw)
Formal class 'dfm' [package "quanteda"] with 15 slots
  ..@ settings    : list()
  ..@ weightTf    :List of 3
  .. ..$ scheme: chr "count"
  .. ..$ base  : NULL
  .. ..$ K     : NULL
  ..@ weightDf    :List of 5
  .. ..$ scheme   : chr "unary"
  .. ..$ base     : NULL
  .. ..$ c        : NULL
  .. ..$ smoothing: NULL
  .. ..$ threshold: NULL
  ..@ smooth      : num 0
  ..@ ngrams      : int 1
  ..@ skip        : int 0
  ..@ concatenator: chr "_"
  ..@ version     : int [1:3] 1 3 14
  ..@ docvars     :'data.frame':    10000 obs. of  0 variables
  ..@ i           : int [1:694230] 0 118 147 279 462 519 651 683 688 927 ...
  ..@ p           : int [1:7692] 0 93 301 9399 10582 11011 12798 13619 13868 16725 ...
  ..@ Dim         : int [1:2] 10000 7691
  ..@ Dimnames    :List of 2
  .. ..$ docs    : chr [1:10000] "text1" "text2" "text3" "text4" ...
  .. ..$ features: chr [1:7691] "<U+1016><U+102E>" "<U+1002><U+103B><U+102E>" "<U+101E><U+100A><U+103A>" "<U+101B><U+1015><U+103A>" ...
  ..@ x           : num [1:694230] 5 1 1 2 1 1 1 1 1 1 ...
  ..@ factors     : list()
library(extrafont)
package 㤼㸱extrafont㤼㸲 was built under R version 3.5.2Registering fonts with R
set.seed(405)
textplot_wordcloud(dfmsyll.tfw, font = "Myanmar3", min_size = 0.8, max_size = 15, min_count = 1000, color = RColorBrewer::brewer.pal(8, "Dark2"))
If you change min_count to min_count = 5000, You get:
I am happy to see that the wordcloud could give interesting information about our written language. Even as a native speaker and a user of written Myanmar language in and out of office for a long time, I have no idea that the syllable “” could occur most frequently in text. Out of curiosity I looked at the wordcloud of my earlier post Word Cloud with Myanmar Syllables. I found that the same syllabe occupies the central place, though that wordcloud was drawn from just five sentences! Then I looked in the Abridged Myanmar Dictionary by the Myanmar Language Commission and the Judson’s Burmese-English Dictionary and found that 151 of 197 pages and 125 of 146 respectively for “” alphabet were words beginning with the syllable “”. So, that was the reason I thought. But that could have been common knowledge. I don’t know.
To see the syllables with highest frquencies, you can use the topfeatures( ) function of quanteda.
topfeatures(dfmsyll.tfw)
             အ       သည\u103a              ။      မ\u103bား              ၊              ကို 
         60916          40545          32959          22132          20969          18734 
တ\u103dင\u103a ဖ\u103cစ\u103a              မ              ခဲ့ 
         16098          14999          13153          12706 
But not all the text is display as Myanmar language characters. The topfeatures() function produces what is known as “named numbers”. One solution I found involves (i)converting the named numbers to a data frame (solution by Mark Needham), (ii)concatenate data in each row of the data frame, and (iii)print with utf8_print() function. Here’s the top 100 syllables with highest frequencies:
tf <- topfeatures(dfmsyll.tfw, n=100)
df.tf <- data.frame(name = names(tf), n = tf, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " -  ", df.tf)))
  [1] "အ -  60916"     "သည် -  40545"    "။ -  32959"     "များ -  22132"  "၊ -  20969"    
  [6] "ကို -  18734"     "တွင် -  16098"    "ဖြစ် -  14999"   "မ -  13153"     "ခဲ့ -  12706"    
 [11] "ရ -  11863"     "သော -  11760"   "က -  11558"     "ရှိ -  11516"     "နှင့် -  10863"   
 [16] "၏ -  10143"     "သ -  9993"      "ရာ -  9963"     "နိုင် -  8906"     "ပါ -  8795"    
 [21] "ရေး -  7942"    "တို့ -  7507"      "ခု -  7360"      "ပြီး -  7278"    "စ -  7143"     
 [26] "နှစ် -  7058"     "နေ -  6849"     "ပ -  6837"      "တ -  6819"      "မှ -  6637"     
 [31] "တစ် -  6430"     "သူ -  6125"      "မှု -  6125"      "သို့ -  6055"      "တော် -  5992"   
 [36] "ခြင်း -  5804"   "ကြီး -  5673"    "မှာ -  5579"     "လာ -  5411"     "သည့် -  5311"    
 [41] "ဖြင့် -  5121"    "ငံ -  5088"      "ဦး -  5051"     "လည်း -  5038"    "၍ -  4888"     
 [46] "ကြ -  4798"     "ထား -  4647"    "အား -  4638"    "လုပ် -  4605"     "မြို့ -  4477"    
 [51] "သာ -  4428"     "သား -  4374"    "ရန် -  4292"     "စာ -  4258"     "ပြု -  4188"    
 [56] "ကာ -  4094"     "ဝင် -  4053"     "တာ -  3942"     "လ -  3933"      "ဆို -  3896"     
 [61] "လက် -  3787"     "မျိုး -  3776"    "မာ -  3721"     "ခံ -  3592"      "လူ -  3454"     
 [66] "ပြည် -  3365"    "ဆောင် -  3256"   "စု -  3173"      "စား -  3155"    "ဟု -  3142"     
 [71] "ဝ -  3124"      "ထို -  3121"      "မည် -  3118"     "ပေး -  3040"    "သုံး -  3025"    
 [76] "ဆုံး -  2900"     "လေ -  2878"     "ပင် -  2763"     "တင် -  2729"     "နောက် -  2714"  
 [81] "ဘာ -  2714"     "မြန် -  2699"    "စေ -  2695"     "နယ် -  2615"     "မင်း -  2595"   
 [86] "ယ -  2587"      "၌ -  2565"      "စစ် -  2544"     "ရေ -  2531"     "ဖွဲ့ -  2514"     
 [91] "မိ -  2498"      "ယူ -  2487"      "တည် -  2475"     "န -  2456"      "ရား -  2432"   
 [96] "နာ -  2376"     "တွင်း -  2345"    "ရောက် -  2343"   "ညာ -  2342"     "ကြောင်း -  2339"
Since the section mark “” represent the end of a sentenc, we can see that the sample text of 10000 lines consisted of 32,959 sentences, containing a total of 1,218,112 syllables (including two punctuation marks, “” and “”).

No comments:

Post a Comment