Bayanathi Technology: The second cycle: QD Corpus 2

This may yet be an another ambitious cycle that begins with a false start. But, as I’ve reported in my last post, I’ve got some solid material to begin working on. Till now I’ve removed the articles and related text created by robots. Next, (i) I’ll remove all non-Myanmar language characters, (ii) split the text into sentences and remove non-sentence text fragments, (iii) remove “sentences” below a certain number of characters so that we have a better chance of getting only genuine sentences. Then, (iv) segment the sentences into syllables. Then we can go on to look at the sentence ending syllables again, and other interesting analyses.

library(xml2)
system.time(
  xdoc_textNodeSet <- xml_find_all(xdoc, "//text()")
)

   user  system elapsed 
  19.90    0.23   20.50

# convert to character vector
xdocTNS.t <- xml_text(xdoc_textNodeSet)

str(xdocTNS.t)

 chr [1:890935] "<U+101D><U+102E><U+1000><U+102E><U+1015><U+102E><U+1038><U+1012><U+102E><U+1038><U+101A><U+102C><U+1038>" ...

# remove all characters not Myanmar Unicode
system.time(
  my_xdocTNS.t <- gsub("[^\u1000-\u104f]", "", xdocTNS.t)
)

   user  system elapsed 
  34.36    0.04   34.65

# remove blank lines
myNbl_xdocTNS.t <- my_xdocTNS.t[!my_xdocTNS.t==""]

The following code block gives us the number of lines in our text (= 184,415), and also a sample of 30 lines. Note that the second line of text is not shown completely.

length(myNbl_xdocTNS.t)

[1] 184415

utf8::utf8_print(myNbl_xdocTNS.t[1:30])

 [1] "ဝီကီပီးဒီးယား"                                                                            
 [2] "စိုင္္စိုင္္စိုင္္"                                                                               
 [3] "ဗဟိုစာမျက်နှာ"                                                                            
 [4] "ဵဗဟိုစာမဵကနာဗဟိုစာမဵကနာဗဟိုစာမျက်နှာယူနီကုဒ်"                                                        
 [5] "မှတ်ချက်။ဤနေရာသည်အက်ဒမင်များစီမံခန့်ခွဲသူများအားမေးခွန်းများမေးမြန်းရန်နေရာမဟုတ်သလိုအက်ဒမင်အဖြစ်လျောက်ထားရန်နေ…"
 [6] "ဝက်ဘ်ဝီကီပီးဒီးယားမြန်မာယူနီကုဒ်"                                                                
 [7] "ဝီကီပီးဒီးယားထိန်းသိမ်းရေး"                                                                   
 [8] "ဝီကီပီးဒီးယားထိန်းသိမ်းရေး"                                                                   
 [9] "ဗဟိုစာမျက်နှာ"                                                                            
[10] "ဝီကီပီးဒီးယားမှကြိုဆိုပါသည်။နိဒါန်းမည်သူမဆိုကြည့်ရှုပြင်ဆင်နိုင်သောအခမဲ့လွတ်လပ်စွယ်စုံကျမ်းဖြစ်ပါသည်။အကြောင်းအရာပေါင်းခုကိုမြန်…"
[11] "ယူနီကုဒ်"                                                                                 
[12] "ယူနီကုဒ်"                                                                                 
[13] "ယူနီကုဒ်"                                                                                 
[14] "ယူနီကုဒ်"                                                                                 
[15] "ပြင်ဆင်ရန်အလိုအလျောက်အတည်ပြုထားသောအသုံးပြုသူများကိုသာခ"                                             
[16] "ဗဟုိစာမ္ယက္န္ဟာဗဟိုစာမဵကနာဗဟုိစာမ္ယက္န္ဟာဝိကိပိဒိယအခမဲ့လ္ဝတ္လပ္စ္ဝယ္စုံက္ယမ္းဝီကီပီးဒီးယားမြန်မာယူနီကုဒ်"                  
[17] "ယူနီကုဒ်"                                                                                 
[18] "ဝီကီပီးဒီးယားထိန်းသိမ်းရေး"                                                                   
[19] "ဝီကီပီးဒီးယားထိန်းသိမ်းရေး"                                                                   
[20] "ဝီကီပီးဒီးယားမြန်မာယူနီကုဒ်"                                                                   
[21] "ယူနီကုဒ်"                                                                                 
[22] "ယူနီကုဒ်"                                                                                 
[23] "ဘော့အလိုအလျောက်ရှင်းလင်းခဲ့သည်"                                                                 
[24] "ဤစာသားကိုမဖျက်ရကျေးဇူးပြု၍ဤစာသားကိုမဖျက်ပါနှင့်ဝီကီပီးဒီးယားမှကြိုဆိုပါတယ်ကျေးဇူးပြု၍ဤအပိုင်းကိုသည်အတိုင်းထားပေးပါ။ဤ…"
[25] "ဗမာစာ"                                                                                
[26] "ဘော့မြန်မာဘာသာစကားသို့ပြန်ညွှန်းနှစ်ထပ်ဖြစ်နေသည်ကိုပြင်နေသည်"                                           
[27] "မြန်မာဘာသာစကား"                                                                        
[28] "မြန်မာဝီကီပိဒီးယားကိုယူနီကုတ်စာသားအသုံးပြု၍ရေးသားသည်။ယူနီကုတ်မြန်မာစာကိုမြန်မာဘာသာဖြင့်ရေးသားသောအင်တာနက်စာမျက်နှာ…"
[29] "ဝီကီပီးဒီးယားမြန်မာယူနီကုဒ်"                                                                   
[30] "ဝီကီပီးဒီးယားမြန်မာယူနီကုဒ်"

We find out the number of sentences contained in lines 1-30 (above), and read line-24 containing three sentences.

nchar(gsub("[^\u104b]","", myNbl_xdocTNS.t[1:30]))

 [1]  0  0  0  0 14  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  4  0
[30]  0

cat(myNbl_xdocTNS.t[24])

ဤစာသားကိုမဖျက်ရကျေးဇူးပြု၍ဤစာသားကိုမဖျက်ပါနှင့်ဝီကီပီးဒီးယားမှကြိုဆိုပါတယ်ကျေးဇူးပြု၍ဤအပိုင်းကိုသည်အတိုင်းထားပေးပါ။ဤစာမျက်နှာကိုပုံမှန်ရှင်းလင်းပါသည်။သင်၏တည်းဖြတ်မှုစွမ်းရည်ကိုအောက်တွင်လွတ်လပ်စွာစမ်းသပ်နိုင်ပါသည်။

We extract sentences (text segments demarcated by “\u104b” character).

# split into sentences
system.time(
  sen_xdocTNS.t <- char_segment(myNbl_xdocTNS.t, pattern = "\u104b", valuetype = "regex", pattern_position = "after")
)

   user  system elapsed 
  11.50    0.55   12.16

To complete the sentences we add back the “\u104b” character. Now we have 508,556 separate “sentences”, one per line.

# add \u104b to the end of line
senP_xdocTNS.t <- do.call(paste, c(list(sen_xdocTNS.t),"\u104b",sep = ""))
length(senP_xdocTNS.t)

[1] 508556

To check, we view the three sentences in line 22-24.

utf8::utf8_print(senP_xdocTNS.t[c(22:24)])

[1] "ဤစာသားကိုမဖျက်ရကျေးဇူးပြု၍ဤစာသားကိုမဖျက်ပါနှင့်ဝီကီပီးဒီးယားမှကြိုဆိုပါတယ်ကျေးဇူးပြု၍ဤအပိုင်းကိုသည်အတိုင်းထားပေးပါ။"
[2] "ဤစာမျက်နှာကိုပုံမှန်ရှင်းလင်းပါသည်။"                                                            
[3] "သင်၏တည်းဖြတ်မှုစွမ်းရည်ကိုအောက်တွင်လွတ်လပ်စွာစမ်းသပ်နိုင်ပါသည်။"

We look at the summary of distribution of number of characters per sentence and their percentiles at 5% intervals.

summary(nchar(senP_xdocTNS.t))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0    57.0    88.0   109.1   135.0 14686.0

quantile(nchar(senP_xdocTNS.t), prob = seq(0, 1, .05), type = 7)

   0%    5%   10%   15%   20%   25%   30%   35%   40%   45%   50%   55%   60%   65%   70% 
    1    16    32    43    51    57    64    69    75    82    88    96   104   113   123 
  75%   80%   85%   90%   95%  100% 
  135   149   168   196   251 14686

Roughly, a longer “sentence” is more likely be a genuine sentence than a shorter one, I guess. So I am going to play safe by dropping about half of the “sentences, that is, by taking only those with 90 characters or more. You may like to take a different cut off point, or use a different strategy altogether.

nc <- nchar(senP_xdocTNS.t)
senP_nc90 <- senP_xdocTNS.t[which(nc >= 90)]
length(senP_nc90)

[1] 250291

We read a random sample of 10 sentences.

set.seed(424)
utf8::utf8_print(senP_nc90[sample(1:length(senP_nc90),10)])

 [1] "မိန်နက္ခတာရာအလင်းနှစ်ဂျွန်ဟာရှယ်သည်မိန်နက္ခတာရာရှိနှစ်ဘက်ခုံးဂလက်ဆီဖြစ်ပြီးနေအဖွဲ့အစည်းမှအလင်းနှစ်၁၈၈မီလီယံကွာဝေးသည်။"      
 [2] "မြန်မာနိုင်ငံတွင်နိုင်ငံတည်ဆောက်ရေးကိုအမှန်တကယ်အာရုံစူးစိုက်စတင်တော့မည်ဟုဆိုလျှင်မြန်မာနိုင်ငံ၏နိုင်ငံတည်ဆောက်ရေးဆိုင်ရာမူဝါဒများကိုဝို…"
 [3] "ပို၍တိကျစွာဆိုရသော်တရုတ်နိုင်ငံ၏တောင်ဘက်၊အိန္ဒိယနိုင်ငံ၏အရှေ့ဘက်နှင့်ဩစတေးလျနိုင်ငံ၏မြောက်ဘက်တို့တွင်တည်ရှိသည်။"                 
 [4] "၁၉၄၇ခု၊ဇူလိုင်လ၁၉ရက်၊စနေနေ့၊နံနက်၁ဝနာရီ၄၅မိနစ်တွင်အာဇာနည်ခေါင်းဆောင်ကြီးများကိုလုပ်ကြံသောမသမာသူတို့၏လက်ချက်ဖြင့်ကျဆုံ…"
 [5] "မြန်မာနိုင်ငံတော်ဗဟိုဘဏ်၏လက်ရှိအတိုးနှုန်းမှတစ်နှစ်လျှင်၁၀ဖြစ်ပြီး၊အပ်ငွေအဖြစ်အနည်းဆုံးအတိုးနှုန်းမှာတစ်နှစ်လျှင်၈နှင့်ချေးငွေအပေါ်…"
 [6] "သံမဏိလုပ်ငန်းတိုးတက်လာပြီးသည့်နောက်တွင်၁၉ဝဝပြည့်နှစ်နောက်ပိုင်းမှစ၍ရှက်ဖီးမြို့၏နယ်နိမိတ်ကိုတိုးချဲ့လာရသည်။"                  
 [7] "သူ့တသက်တွင်နိုင်ငံတော်နှင့်အစိုးရအကြီးအကဲစသည့်ရာထူးများကိုမရယူခဲ့သော်လည်းလက်တွေ့တွင်မူ၁၉၇၈ခုနှစ်မှ၁၉၉၀ပြည့်လွန်နှစ်များအထိတရုတ်ပြည်…"
 [8] "၂၀၁၀ခုနှစ်၊မတ်လတွင်ပါတီကိုထပ်မံမှတ်ပုံတင်ရန်နှင့်ရွေးကောက်ပွဲဝင်သင့်ကြောင်းကိုဦးဆောင်ဆွေးနွေးခဲ့သည်။"                     
 [9] "လီနင်ဂရက်မြို့တွင်ကျယ်ပြန့်သောရိပ်သာများ၊နန်းတော်နှင့်ပြည်သူပိုင်အဆောက်အအုံများကြောင့်ထင်ရှားသည်။"                   
[10] "အထက်မြန်မာပြည်နှင့်ရှမ်းကုန်းပြင်မြင့်ဒေသတွင်အခြေချသည့်တရုတ်လူမျိုးအများစုမှာယူနန်ပြည်နယ်ဘက်မှလာသည့်ယူနန်လူမျိုး၊ပန်းသေးမျ…"

I am calling this collection of 250,291 Myanmar-Wikipedia sentences as myWiki-QDC2 and saves it to the text file: myWiki-QDC2.txt. 119MB. Available for download here.

writeLines(senP_nc90, con = "myWiki-QDC2.txt", useBytes = TRUE)

Wednesday, April 24, 2019

The second cycle: QD Corpus 2

No comments:

Post a Comment

Blog Archive