The wordcloud of syllables in the previous post shows that among the syllables with highest frequencies were “သည် - 40545” and “။ - 32959” and “၏ - 10143”. From this I could guess through experience that most of the first and third syllables will be associated with the character in the middle to mark the end of a sentence. These two syllables would be found in most of the formal writings while other less formal sentence endings would occur considerably less. To test this idea out, I could, (i)create bigrams from the syllables tokenized as words, (ii)keep only the bigrams that consisted of any one syllable with “။”,(iii)create dfm from the syllables obtained in step (ii).
From the previous exercise, I’ve got the tokenized syllables in 10,000 elements: syll.tfw.
library(quanteda)
# create bigrams
xx <- tokens_ngrams(syll.tfw, n = 2)
# convert to dfm
dfmxx <- dfm(xx)
# retain only the sentence endings
senEnd <- dfm_select(dfmxx, "\u104b$", selection = "keep", valuetype = "regex")
By default quanteda shows the bigrams by separating the two components with an underscore.
stfD <- topfeatures(senEnd, n=50)
df.stfD <- data.frame(name = names(stfD), n = stfD, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " - ", df.stfD)))
[1] "သည်_။ - 25908" "၏_။ - 1781" "တယ်_။ - 1143" "ပါ_။ - 561" "ပေ_။ - 504"
[6] "မည်_။ - 472" "ချေ_။ - 236" "ဘူး_။ - 129" "။_။ - 121" "တည်း_။ - 89"
[11] "ပြီ_။ - 77" "ပဲ_။ - 68" "မယ်_။ - 67" "ရ_။ - 60" "ခြင်း_။ - 48"
[16] "ခဲ့_။ - 46" "ရှိ_။ - 41" "ပေါ့_။ - 40" "နှင့်_။ - 36" "၂_။ - 32"
[21] "ထွက်_။ - 30" "ဘဲ_။ - 26" "၄_။ - 23" "၁_။ - 22" "များ_။ - 22"
[26] "ရန်_။ - 22" "စေ_။ - 21" "ရား_။ - 20" "ကို_။ - 20" "တဲ့_။ - 18"
[31] "၃_။ - 17" "ဟုတ်_။ - 17" "နည်း_။ - 17" "တော့_။ - 16" "လဲ_။ - 16"
[36] "လား_။ - 15" "၅_။ - 15" "၆_။ - 14" "ချက်_။ - 14" "တာ_။ - 14"
[41] "ကုန်_။ - 13" "သော_။ - 13" "လေ_။ - 13" "ပင်_။ - 12" "၈_။ - 12"
[46] "မင်း_။ - 11" "ဝ_။ - 11" "ကြ_။ - 10" "ရေး_။ - 10" "စို့_။ - 10"
To create wordcloud I prefer to use sentence ending bigrams without the underscore character in-between. Now you could create that by specifying concatenator = “”.
xx.0 <- tokens_ngrams(syll.tfw, n = 2, concatenator = "")
# convert to dfm
dfmxx.0 <- dfm(xx.0)
# retain only the sentence endings
senEnd.0 <- dfm_select(dfmxx.0, "\u104b$", selection = "keep", valuetype = "regex")
We plot a wordcloud of sentence endings. Taking clue from the 50 “topfeatures” of sentence ending shown above, I chose the minimum frequency of 10. I chose the range from smallest to largest character size to be 3:10 for convenient viewing. The real range of 10:25908 is just too big to reflect on the plot. Since we are dealing with text from an encyclopedia, I’d expected to see formal writing style with a predominant sentence ending with the syllable “သည်” followed by a much less frequent cases of “၏”, and a few other formal and informal (or conversational style) sentence endings.
library(extrafont)
set.seed(413)
textplot_wordcloud(senEnd.0, font = "Myanmar3", min_size = 3, max_size = 10, min_count = 10, color = RColorBrewer::brewer.pal(6, "Dark2"))
The total number of different sentence ending syllables from our sample of 10,000 texts, is the total number of features in the senEnd dfm, obtainable through the nfeat() function. When that came out to be 570, I was completely surprised. It cannot be that many. So something must be wrong.
sum(nfeat(senEnd))
[1] 570
So I looked at the lowest end of the frequencies. To me, almost all of them here look like something other than true sentence ending syllables.
stfA <- topfeatures(senEnd, n=100, decreasing = FALSE)
df.stfA <- data.frame(name = names(stf), n = stf, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " - ", df.stfA)))
[1] "၆၃၅_။ - 1" "၂၂၄_။ - 1" "နှင်း_။ - 1" "ဏှာ_။ - 1" [5] "အင်း_။ - 1" "ရှိန်_။ - 1" "ယားစ်_။ - 1" "လက္ခန်_။ - 1" [9] "၉၅_။ - 1" "ယုတ်_။ - 1" "၂၇၁_။ - 1" "၂၀၀၈_။ - 1" [13] "၁၅_။ - 1" "ဟိုရ်_။ - 1" "၁၃၁၄_။ - 1" "ခုန်_။ - 1" [17] "၁၇၄_။ - 1" "သာျ_။ - 1" "ယောက်_။ - 1" "ယို_။ - 1" [21] "ဘို့_။ - 1" "မြှောက်_။ - 1" "၅၀_။ - 1" "၈၄_။ - 1" [25] "ပေး_။ - 1" "၂၀၁၂_။ - 1" "သို့_။ - 1" "တ_။ - 1" [29] "သျ_။ - 1" "ထာ_။ - 1" "ကိစ္စ_။ - 1" "မုန်_။ - 1" [33] "၃၀၆_။ - 1" "မောက္ခ_။ - 1" "လက္ခံ_။ - 1" "၁၉၈၄_။ - 1" [37] "ဒ_။ - 1" "၂၁၇_။ - 1" "လွဲ_။ - 1" "ရုံး_။ - 1" [41] "လတ္တံ_။ - 1" "မောင်_။ - 1" "နစ်_။ - 1" "မွတ္က်ာက္ျရွည္ခံ_။ - 1" [45] "၁၀၄_။ - 1" "၁၈၇_။ - 1" "သက်_။ - 1" "ဖျတ်_။ - 1" [49] "ကြမ်း_။ - 1" "သည်း_။ - 1" "ခိုက်_။ - 1" "နပ်_။ - 1" [53] "ဝယ်_။ - 1" "ပြောင်း_။ - 1" "ကျု_။ - 1" "အယ်လ်_။ - 1" [57] "ရော_။ - 1" "၅၈_။ - 1" "၂၇_။ - 1" "သာ်_။ - 1" [61] "သည််_။ - 1" "ပြင်_။ - 1" "ရဲ_။ - 1" "ထုတ်_။ - 1" [65] "သဲ_။ - 1" "၁၈၃၄၉_။ - 1" "သိ_။ - 1" "ကောင်_။ - 1" [69] "ဆယ်_။ - 1" "၉၃_။ - 1" "ထက်_။ - 1" "၁၄၁_။ - 1" [73] "စ_။ - 1" "နိုး_။ - 1" "၄၄၈_။ - 1" "ပျိုး_။ - 1" [77] "ကျွန်_။ - 1" "၅၄_။ - 1" "မှီ_။ - 1" "စည်_။ - 1" [81] "၂၀၃_။ - 1" "သမ္ပန္နော_။ - 1" "သောဝ်_။ - 1" "လက်_။ - 1" [85] "လွှင့်_။ - 1" "၇၂_။ - 1" "မြူ_။ - 1" "ဇာ_။ - 1" [89] "ကော_။ - 1" "၂၀_။ - 1" "၃၄၀_။ - 1" "၃၁၃_။ - 1" [93] "ဥက္ကဋ္ဌ_။ - 1" "ပေါ်_။ - 1" "ခင္း_။ - 1" "ချန်_။ - 1" [97] "၇၄_။ - 1" "၂၂၁_။ - 1" "၁၇၁၄၉_။ - 1" "ဂါ_။ - 1"
Particularly, I couldn’t make sense out of ngram [44]. May be the method I used in extracting sentences out of the Wikipedia raw corpus had flaws. I may need to look at a sample of these strang cases sentence by sentence to get some idea of what went wrong.
No comments:
Post a Comment