Bayanathi Technology: Cycle2: Quick and Dirty Corpus-3

More extensive cleaning tried out as reported in my last post is applied to the whole of xdoc_rb_mySenY_gt5.

x100 <- xdoc_rb_mySenY_gt5
library(stringr)
library(quanteda)

system.time({
# remove infobox
  x100_iBoxN <- sub("\\{\\{Infobox(.+\n)+\\}\\}","",x100)

# removed all titles; removed bulleted lists

  x100_iBoxN_titBulN <- str_remove_all(x100_iBoxN, regex("^==*.+==*.*\n", multiline=T)) %>%
    str_remove_all(., regex("^\\*.*[\u1040-\u1049]+.+\n", multiline = T))

# breaking up into sentences
  x100_itN_sen <- char_segment(x100_iBoxN_titBulN, pattern = "\u104b", valuetype = "regex", pattern_position = "after")

# ratio of English to Myanmar characters
  en.n <- str_count(x100_itN_sen, pattern = "[a-zA-Z]")
  my.n <- str_count(x100_itN_sen, pattern = "[\u1000-\u104f]")
  emR.pc <- round(en.n*100/my.n, 1)

# keep sentences with "emR.pc" of less than 50%; remove *file* and *image* references
  x100_itN_sen.1 <- x100_itN_sen[which(emR.pc<50)] %>%
    gsub("\\[\\[File:.+\\]\\]","",.) %>%
    gsub("\\[\\[Image:.+\\|","",.)

# removed the hyperlink markers and other unneccessary characters
  x100_itN_sen.2 <- gsub("\\[\\[[^\u1000-\u104f]+\\]\\]","",x100_itN_sen.1) %>%
    gsub("\\[","",.) %>%
    gsub("\\]","",.) %>%
    gsub("['|]","",.) %>%
    gsub("\\(\\{\\{.+\\}\\}\\)","",.) %>%
    gsub("\\{\\{[A-Za-z]+ .*[A-Za-z]+\\}\\}\n+", "", .) %>%
    gsub("<.*>.+", "", .) %>%
    gsub("\n","", .) %>%
    gsub("[A-Za-z]+\\{+.+\\}+","", .) %>%  # added
    gsub("\\{+.+\\}+","", .) %>%           # added
    gsub("^[\\#]","", .) %>%               # added
    .[which(nchar(.)>50)]
})

   user  system elapsed 
 118.95    0.96  121.04

length(x100_itN_sen.2)

[1] 320594

Randomly picking up different sentences and looking for problems showed that there were sentences that begin with the punctuation mark consisting of the characters ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.. We’ll check for them and write them out to a text file.

x100_itN_sen.2[which(grepl("^[[:punct:]]",x100_itN_sen.2))] %>%
  writeLines(.,con="x100_punctBegin.txt", useBytes = TRUE)

It consisted of 12,231 sentences and since they are not too many, I decided to remove them and I am left with 308,363 sentences.

x100_itN_sen.3 <- x100_itN_sen.2[which(grepl("^[[:punct:]]",x100_itN_sen.2)==FALSE)]
length(x100_itN_sen.3)

[1] 308363

Then I add the sentence boundary marks to complete the sentences.

x100_itN_sen.4 <- paste0(x100_itN_sen.3,"\u104b")

Now I thought it is done for good! Then I felt uneasy about the quotation marks in the sentences. What if I’ve deleted them where they need to be left alone? So I now check to see if there are pairs (even number of them) in each sentence. Casually picking up a number of them, I hit on a sentence quoting a review on an accomplished writer on the life of a fisherman.

kpat <- str_count(x100_itN_sen.4, pattern = "[\"]")
table(kpat %% 2 != 0)


 FALSE   TRUE 
307935   1070

cat(x100_itN_sen.4[which(kpat %% 2 != 0)][225:229])

တံငါ ဘဝ သရုပ်ဖော် ဝတ္ထုများတွင် မြန်မာ စာပေ၌ ကြယ်နီသည် မီးရှူး တန်ဆောင် ဖြစ်ခဲ့သည်" ဟု ကေတု ဘုန်းမော်က မှတ်ချက် ပြုသည်။ သရုပ်ဖော် စာပေ ရေးသား ပြုစုရာမှာ များစွာ ပါရမီ ရင့်သန်တဲ့ စာရေး ဆရာ ကြယ်နီ" ဟု သိန်းဖေမြင့်က ချီးကျူး ခဲ့သည်။ တရွေးကို ဆယ်စိတ် စိတ်၍ ရေးသား ခဲ့သော ဆရာ ကြယ်နီ" ဟု တင့်တယ်က ထောမနာ ပြုခဲ့သည်။ ပညာပေး သက်သက် မဟုတ်ဘဲ အနုပညာ ရသ မြောက်အောင် ရေးကြရာတွင် ရေလုပ်သား ဘဝ တံငါ ဝတ္ထု၌ ကြယ်နီသည် ... စာဖတ် ပရိသတ်၏ စိတ်နှလုံးကို ယူကျုံး နိုင်သည် အထိ ကလောင်စွမ်း ထက်လှသည်" ဟူ၍ စာသုသီက သုံးသပ် ပြခဲ့သည်။ စာပေ ဂန္တဝင်နယ်၌ ကြယ်နီနီ ကလေးများ တဖျတ်ဖျတ် တောက်ပလျက်ကား ကျန်ရှိ ရစ်ခဲ့ ပါသည်" ဟူ၍ မင်းဏီက ရေးခဲ့သည်။

From the above it is clear that the opening quote mark was legitimately placed at the beginning of first sentence and ended in the middle of next sentence. And I had wrongly deleted the first quote mark! May be if I leave the opening quote mark, it would be fine. Or is it?

It would not be! Because:

That is from the page: x100[1612] on Myanmar writer U Latt. Because of breaking up of the individual sentences, the sentence bounded in red line has the opening quote mark, and the next sentence artifically got the first one’s closing quote mark as its opening quote mark (green line bounded sentence).

So the easiest and the correct solution would be not to break up the sentences, but to leave the paragraphs as-is. However, exactly the same difficulty could be found with English also and I don’t know how they would handle that. With my limited knowledge, I have seen punctuations, etc. removed from corpora before going on with NLP work (in English) and I don’t know if that is applicable here.

Monday, May 20, 2019

Cycle2: Quick and Dirty Corpus-3

No comments:

Post a Comment

Blog Archive