Saturday, November 19, 2022

Most used Myanmar alphabets in text

 
Ye Naingthinn, a young friend of mine, recentlly recounted his experience of typing Myanmar language text for 14 years and suggested the most frequently used alphabet was “န”. He got this insight from inspecting his keyboards for the degree of worn-out status of the different keys.

Out of curiosity I set out to find out the frequencies of Myanmar alphabets from the three Myanmar language corpora I have in hand:

  • My own corpus of Wikipedia articles extracted from the Myanmar Language Wikipedia site. This consisted of 306,405 sentences. I had posted about the compilation of this corpus in a series of posts in my “Bayanathi Technology” blog beginning with “An Ambitious Big Myanmar Corpus” of February 28, 2019.
  • A Corpus of Modern Burmese; 20,106 sentences. This is a corpus of modern Burmese compiled by John Okell in the 1990s and converted into Unicode more recently. Available here.
  • Asian Language Treebank Parallel Corpus; 20,000 sentences. Available here.

The workflow would be quite simple:

  1. Read the corpus and extract only the Myanmar alphabets in the text. The extraction could be done by using the “Stringr” package in the R computing environment.
  2. Count the occurence of each alphabet. This could be done with the “Qu‌‌anteda” package.

For my own Wikipedia corpus

For the purpose of this exercise I used the list of alphabets as marked with the red border in the Myanmar Unicode chart shown below:

The two steps in the above workflow outlined above was completed for 306,405 sentences in 8.96 seconds.

library(quanteda)
library(stringr)
system.time(
  {
    wikiCorpus_dfm <- str_extract_all(x100_itN_Sen.5, "[\u1000-\u1021]") %>%
      tokens() %>%
      dfm()
  }
)
   user  system elapsed 
   8.40    0.31    8.96 

The following output shows each alphabet found in the corpus together with its total frequency listed in the descending order of frequencies.

topf_wiki <- topfeatures(wikiCorpus_dfm, n=50)
df_topf_wiki <- data.frame(name = names(topf_wiki), n = topf_wiki, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " -  ", df_topf_wiki)))
 [1] "က -  1616928" "င -  1602978" "မ -  1334124" "သ -  1257355"
 [5] "တ -  1225978" "န -  1137802" "ပ -  1058835" "ရ -  955376" 
 [9] "စ -  894067"  "အ -  852211"  "ည -  783737"  "လ -  734658" 
[13] "ခ -  670507"  "ဖ -  359153"  "ထ -  318580"  "ယ -  309591" 
[17] "ဆ -  296001"  "ဝ -  194679"  "ဘ -  150506"  "ဒ -  131955" 
[21] "ဟ -  115057"  "ဂ -  107010"  "ဗ -  67088"   "ဇ -  53375"  
[25] "ဉ -  41166"   "ဏ -  35751"   "ဓ -  34915"   "ဌ -  17577"  
[29] "ဋ -  9724"    "ဈ -  3509"    "ဍ -  3275"    "ဠ -  3213"   
[33] "ဃ -  2363"    "ဎ -  103"    

For the John Okell’s corpus and the Asian Language Treebank Parallel Corpus

Result for the “A Corpus of Modern Burmese” by John Okell:

 [1] "က -  148612" "တ -  120899" "မ -  100653" "င -  93395" 
 [5] "ပ -  86787"  "န -  78937"  "သ -  78889"  "လ -  77981" 
 [9] "ရ -  73253"  "စ -  56417"  "အ -  52951"  "ည -  44181" 
[13] "ခ -  40478"  "ယ -  34222"  "ထ -  22462"  "ဆ -  21630" 
[17] "ဖ -  20911"  "ဘ -  18730"  "ဟ -  11764"  "ဒ -  10427" 
[21] "ဝ -  9883"   "ဗ -  4755"   "ဂ -  3638"   "ဇ -  2345"  
[25] "ဉ -  2219"   "ဏ -  2196"   "ဓ -  1807"   "ဈ -  493"   
[29] "ဌ -  440"    "ဃ -  262"    "ဠ -  259"    "ဋ -  167"   
[33] "ဍ -  116"    "ဎ -  5"     

Result for the “A Corpus of Modern Burmese Asian Language Treebank Parallel Corpus”.

 [1] "က -  151662" "င -  123947" "မ -  113792" "တ -  111718"
 [5] "သ -  107562" "န -  105134" "ပ -  98080"  "အ -  87225" 
 [9] "ရ -  84712"  "စ -  84637"  "ခ -  76598"  "လ -  62294" 
[13] "ည -  60714"  "ဆ -  33621"  "ယ -  32843"  "ဖ -  29508" 
[17] "ထ -  23509"  "ဒ -  16204"  "ဘ -  13650"  "ဝ -  13389" 
[21] "ဟ -  11408"  "ဂ -  10565"  "ဗ -  4703"   "ဉ -  4222"  
[25] "ဇ -  3017"   "ဏ -  2993"   "ဓ -  1408"   "ဌ -  1197"  
[29] "ဈ -  315"    "ဍ -  278"    "ဋ -  235"    "ဠ -  19"    
[33] "ဃ -  3"     

No comments:

Post a Comment