Monday, October 7, 2019

Syllable co-occurrence


I have in hand a respectably large number of syllables in Myanmar language. What could I do with them. The easiest one I could do is to construct wordclouds with them and I had done it. May be it will be interesting to find out how different syllables are associated within a sentence. Something like the association of variables which I am a little bit familiar. Then I skimmed through the help pages of the Quanteda package, hoping to find some function that would do something related. I found “fcm()” function that would create a “feature co-occurrence matrix”:
Create a sparse feature co-occurrence matrix, measuring co-occurrences of features within a user-defined context. The context can be defined as a document or a window within a collection of documents, with an optional vector of weights applied to the co-occurrence counts.
Well, I guess fcm() would give something not quite like the association of numeric variables, but rather the syllables that are found together in sentences.
The following network plot of an “fcm” object, where edges show co-occurrences of features in the quanteda tutorial is what I would like to do.
Recalling that I have 300K-plus sentences tokenized into “words” as reported in my earlier post “Cycle 3: Naive word segmentation (that works)”, I tried running “fcm()” on that data, and failed.
x100.dfm <- dfm(x100_itNS5.w_paliN)
fcmat_news <- fcm(x100.dfm)
Error in .local(x, y, ...) : 
  Cholmod error 'out of memory' at file ../Core/cholmod_memory.c, line 147
Later I found out that it meant I didn’t have enough memory to run fcm(). No wonder! The Global Environment pane of RStudio session showed that “x100.dfm” is a “Large dfm (25578584190 elements, 129.1 Mb”. However, it is still wonderful that Quanteda could run dfm() on “x100_itNS5.w_paliN”, on my little machine.
Now I remember that I have the first 20K sentences from my Myanmar Wikipedia corpus tokenized into syllables with my own syllabification code. To get meaningful result out of the fcm exercise, I guess I will have to remove all English characters including punctuations, remove Myanmar punctuations and numbers as well as Myanmar stopwords before running fcm.
So I looked for Myanmar stopwords on the Web. First I found the authors of Statistical Analyses of Myanmar Corpora had assembled an 1.6 million-sentence Myanmar corpus, out of which they had identified about 1216 stopwords. However, I couldn’t find any clue if they were sharing any of their work.
On the other hand swanhtet1992/myanmar-data on GitHub has shared 275 Myanmar stopwords and I like his one-line README:
အရေးကြီးတာ၊ လိုအပ်တာ သိကြတယ်မှလား။ နောက်လူတွေအတွက် ရှိတာလေးတွေထုတ်ပေးကြတာပေါ့။
Kudos to you swanhtet -yay!
Anyway, to continue with my quest for fcm, I thought I going to forget about stopwords (in general) for the moment and just concentrate on leaving out the sentence ending syllable (excluding the section mark, “။”) because it will co-occur with every other syllable in many sentences, and obviously qualify as a stopword.
To begin with I have this “x20k_syllQS” which is the syllabified first 20,000 sentences from my Myanmar Wikipedia corpus of 306290 sentences.
(1) I remove the last two syllables from each of the 20,000 senteces, consisting of the sentence ending syllable and the section mark “။”.
(2) Create corpus from the results of (1)
(3) Create tokens.
(4) Remove all tokens containing English characters, Myanmar numbers, punctuations.
(5) Create dfm.
(6) Create fcm.
Now for step(1) we remove the last two syllables from each of the 20,000 senteces.
f <- function(x) x[1:(length(x)-2)]
x20ksyll.N2 <- sapply(x20k_syllQS, f)
x20ksyll.N2[c(1,20000)]
[[1]]
 [1] "ဂူ"               "ဂဲ"               "၏"               "သု"               "ည"              
 [6] "စီ"               "မံ"               "ကိန\u103aး"       "("               "P"              
[11] "r"               "o"               "j"               "e"               "c"              
[16] "t"               "Z"               "e"               "r"               "o"              
[21] ")"               "လေ့"              "လာ"              "ရ\u103eာ"        "ဖ\u103dေ"       
[26] "သူ"               "ဖ\u103cစ\u103a"  "သည့\u103a"        "ဂ\u103bန\u103aး" "ဟ\u103dန\u103aး"
[31] "က"               "က\u103dတ\u103a"  "ကီး"              "မ\u103bား"       "သည\u103a"       
[36] "က\u103cား"       "ခံ"               "မ\u103bား"       "ဖ\u103cစ\u103a"  "သည့\u103a"       
[41] "ဝိုင\u103a"        "ဖိုင\u103a"        "ထောက\u103a"      "ပံ့"               "သ"              
[46] "မ\u103bား"       "က"               "ဖတ\u103a"        "ရ\u103eု"         "နိုင\u103a"       

[[2]]
 [1] "ထို"              "အ"              "ခ\u102b"        "မိ"              "ခင\u103a"      
 [6] "ဖ\u103cစ\u103a" "သူ"              "သည\u103a"       "\""             "မိ"             
[11] "မိ"              "တို့"              "တ\u103dင\u103a" "အ"              "မ\u103dေ"      
[16] "ဆက\u103a"       "ခံ"              "မည့\u103a"       "အ"              "မ\u103dေ"      
[21] "ခံ"              "သား"            "မ"              "ရ\u103eိ"       
For step(2) to (5):
system.time(
  syll20k.dfm <- corpus(t(data.frame(lapply(x20k_syllQS, paste0, collapse = " ")))) %>%
    tokens(., what = "fasterword") %>%
    tokens_select(.,"[\u1040-\u1049\u104a-\u104b]|[[:punct:]]|[A-z0-9]","remove", valuetype="regex") %>%
    dfm(.)
)
   user  system elapsed 
  58.91    0.06   59.61 
dim(syll20k.dfm)
[1] 20000  5153
That resulted in a document-feature matrix of: 20,000 documents (sentences), 5,153 features (syllables) that is 99.4% sparse.
For step(6), we create fcm. That resulted in a feature co-occurrence matrix of: 5,153 by 5,153 features (syllables).
system.time(
  syll20k.fcm <- fcm(syll20k.dfm)
)
   user  system elapsed 
   3.43    0.42    3.92 
dim(syll20k.fcm)
[1] 5153 5153
feat <- names(topfeatures(syll20k.fcm, 50))
syll20k.fcm_select <- fcm_select(syll20k.fcm, pattern = feat)
dim(syll20k.fcm_select)
[1] 50 50
I tried to bluff the “textplot_network()” function by adding family = “Pyidaungsu”. Didn’t work.
size <- log(colSums(dfm_select(syll20k.dfm, feat)))
set.seed(93019)
textplot_network(syll20k.fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3, family = "Pyidaungsu")
Looking hard at the syntax of the taxplot_network() function:
I see that I need to tell the font I’ll use to label the vertices of the plot. I bluffed again:
size <- log(colSums(dfm_select(syll20k.dfm, feat)))
set.seed(93019)
textplot_network(syll20k.fcm_select, min_freq = 0.8, vertex_size = size / max(size) * 3, vertex_labelfont = "Pyidaungsu")
Error in check_font(vertex_labelfont) : 
  Pyidaungsu is not found on your system. Run extrafont::font_import() and extrafont::loadfonts(device = "win") to use custom fonts.
It won’t be fooled! It tells me to use “extrafont” again! As you’ll recall I wasn’t able to import fonts with “extrafont” on my 32-bit, Windows-7 Lenovo laptop.
Desperate, I tried using the showtext package and the Cairo graphic package as suggested by Yixuan Qiu in “showtext: Using System Fonts in R Graphics”. I tried many variations trying to copy him, but none worked out the way it should.
But finally:

The above plot shows the co-occurrence of 50 syllables with highest frequencies from 20,000 sentences. The syllables co-occurrence matrix could be viewed like this:
library(kableExtra)
y <- syll20k.fcm_select
kable(y) %>%
  kable_styling(bootstrap_options = c("striped", "hover", 
  "condensed"),font_size = 11) %>%
     scroll_box(width = "600px")
'as.data.frame.dfm' is deprecated.
Use 'convert(x, to = "data.frame")' instead.
See help("Deprecated")
documentသည်တွင်ခဲ့သို့ရှိရာလည်းကိုသောပါတစ်အားခုတို့မျိုးနှစ်မှမှာမှုရန်လက်ပြီးရေးကြီးစာလူသာသားလုပ်တော်ပြုမင်းရားမြို့ဦးပင်ထိုစစ်ဝင်ငံဆောင်စားမည်ခြင်းတွင်း
သည်10756126335725014161108336211677810095979954991865812868660268394307636685394023674411721631440484797389835345638749157493871324249124425433063513854377128242488358144323503401227853448454329762972291254172464
048822438548343990358236384672458627868752557737322731197224113870153423966676266122742342180718252250309727111830136828142097189135891865231315261486120918771558152213711476156813221433177739491011
0063560237161773113068116071961419145983034222253731439912291837111344150418906114032376711695841511048832170361092118586107167865590984238490105931081590726833393944655272815860046499478871758978778064745796129475758
တွင်0002337556728852393449744962015665549262840279515834088284915044370452224251239196016911554272834802478187312301700169818102566160321431096863201919551149144512041844239714961095107318421003
ခဲ့0000170022951912263232401762611633282466174913743030219610273573373023291137181813221228253331962013149897713331525140422211210164288066112551955793110511001339177913078487431387950
0000015051127195923491038426029991719116110051467182575715223044129582913669967901174241412771216600123097212901585999986482754636110149764251578211368405425671702463
သို့0000009112038223912453776283613731452893115816267221164260614447688879658121168131811546525829367581048134982579155155060183479710005907099686775788471355606
ရှိ000000017303251177552354541225121961305179923881468169644101974152218981239132818282185172011461036160011401501181511309586178921272118610331117732935146892286710492265915
ရာ000000002733160558844097232721541283189527291215205943032068138017751211128816943280228616071028166315881865281213541104104791112481844104211197881172139615179328831872808
လည်း000000000873371619781352112466780914626519092140935637839573653728146310207494939567678191152696553573481571683595605500575753540515564933407
ကို00000000005161819546273774251733775495222135608246360321153100289024083781490833802601183431312877333538843269213616061818153027061832242116701704236320481928219549641280
သော000000000003098272230272010230338291817208254712442173721591414146419652996248717171333213817602053256919691393100314391133147513851514898139816431077121211623147946
ပါ0000000000001544144811161437180796413162955138614911275998970152226841402115171512621107121216529918655746195401276689597580178299210568088991453574
တစ်0000000000000177383024911408102115592173144399695681882313051412105174966890779396792477365439946262014147398574697188475796456371259479
အား0000000000000063168112775326641716760493931882482826145690147650666462377010467024563984572847033385095084276906455474721217375
ခု0000000000000001229113459945582163155483895984682013111775125410995608776999751214619139040430393710034915296538141200752505485790518
တို့00000000000000001633122312383327158811791365999105112201791149310191139151013691130196710458819309157341131906108772791710357758088382310584
မျိုး000000000000000001141679144373770564040638760610415505491216698942466522458367172209225552607469267555514347647338784267
နှစ်00000000000000000015192486169674794480193113881762130011836279649578421562638160761839893112436156026167851214841549486797670
0000000000000000000352625111739226915181514214030952320185312012383174619353198172516211264152311652036109513531087112915501275119716513867939
မှ00000000000000000000754770943780807137314271113716615894878922111974789947937373796465376253673710016705966081151601
မှာ0000000000000000000004966534645451006990852559582890615641849436473360360419660702488352434484399464526898364
မှု000000000000000000000011807936889502489810620743791633136370388154318447227570732343243670995010995714351854498
ရန်000000000000000000000006425087211619680545446553580900774788489254295543449370450500461681790453347845396
လက်0000000000000000000000006607591111784533317586542716790479409548288378506337356388356452499370402790334
ပြီး0000000000000000000000000473149210156725518007939541212719725419416592896492610614731874695538508980487
ရေး0000000000000000000000000034331908211111271188198422831800109310293655346411711504621104811432510242979482014881067
ကြီး00000000000000000000000000012668665558938869472301565866119185210101518618657592656874801507491724523
စာ0000000000000000000000000000156443610341105605905589488271247375937312378297569576483430437739317
လူ00000000000000000000000000000496555645452470391323174195300706249356235428472391379305627244
သာ000000000000000000000000000000104585645913996785515865714879025085253265136394424506281096295
သား00000000000000000000000000000007246351235563467832406394820330458390520606561482487775355
လုပ်000000000000000000000000000000001621615142645118632739562137644330953373911545624221644429
တော်000000000000000000000000000000000257178710651811166211371615570649566664924921600668932608
ပြု00000000000000000000000000000000004843762683702404213614813194005094454483951570264
00000000000000000000000000000000000882356352591627314361416499658486256296736382
မင်း000000000000000000000000000000000000846641476543236299230279166296285337242235
ရား0000000000000000000000000000000000000601279385214285187290215208179257911172
မြို့000000000000000000000000000000000000001039608361294308372448289277194299386
ဦး0000000000000000000000000000000000000004380299371491818871930431407472328
ပင်0000000000000000000000000000000000000000552480215301353250292356570246
ထို00000000000000000000000000000000000000000221265299380296264428579293
စစ်000000000000000000000000000000000000000000493339435324208244441357
ဝင်0000000000000000000000000000000000000000000347710500375279575328
ငံ000000000000000000000000000000000000000000001100850400311627498
ဆောင်000000000000000000000000000000000000000000000671287348735461
စား0000000000000000000000000000000000000000000000454250791221
မည်00000000000000000000000000000000000000000000000489515174
ခြင်း0000000000000000000000000000000000000000000000003085468
တွင်း0000000000000000000000000000000000000000000000000213
As of now, I am leaving the interpretation of this table and above plot to you, and also let you find out for yourselves the right way to write code to get Myanmar text in the plot.

No comments:

Post a Comment