More extensive cleaning tried out as reported in my last post is applied to the whole of xdoc_rb_mySenY_gt5.
x100 <- xdoc_rb_mySenY_gt5
library(stringr)
library(quanteda)
system.time({
# remove infobox
x100_iBoxN <- sub("\\{\\{Infobox(.+\n)+\\}\\}","",x100)
# removed all titles; removed bulleted lists
x100_iBoxN_titBulN <- str_remove_all(x100_iBoxN, regex("^==*.+==*.*\n", multiline=T)) %>%
str_remove_all(., regex("^\\*.*[\u1040-\u1049]+.+\n", multiline = T))
# breaking up into sentences
x100_itN_sen <- char_segment(x100_iBoxN_titBulN, pattern = "\u104b", valuetype = "regex", pattern_position = "after")
# ratio of English to Myanmar characters
en.n <- str_count(x100_itN_sen, pattern = "[a-zA-Z]")
my.n <- str_count(x100_itN_sen, pattern = "[\u1000-\u104f]")
emR.pc <- round(en.n*100/my.n, 1)
# keep sentences with "emR.pc" of less than 50%; remove *file* and *image* references
x100_itN_sen.1 <- x100_itN_sen[which(emR.pc<50)] %>%
gsub("\\[\\[File:.+\\]\\]","",.) %>%
gsub("\\[\\[Image:.+\\|","",.)
# removed the hyperlink markers and other unneccessary characters
x100_itN_sen.2 <- gsub("\\[\\[[^\u1000-\u104f]+\\]\\]","",x100_itN_sen.1) %>%
gsub("\\[","",.) %>%
gsub("\\]","",.) %>%
gsub("['|]","",.) %>%
gsub("\\(\\{\\{.+\\}\\}\\)","",.) %>%
gsub("\\{\\{[A-Za-z]+ .*[A-Za-z]+\\}\\}\n+", "", .) %>%
gsub("<.*>.+", "", .) %>%
gsub("\n","", .) %>%
gsub("[A-Za-z]+\\{+.+\\}+","", .) %>% # added
gsub("\\{+.+\\}+","", .) %>% # added
gsub("^[\\#]","", .) %>% # added
.[which(nchar(.)>50)]
})
user system elapsed
118.95 0.96 121.04
length(x100_itN_sen.2)
[1] 320594
Randomly picking up different sentences and looking for problems showed that there were sentences that begin with the punctuation mark consisting of the characters ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.. We’ll check for them and write them out to a text file.
x100_itN_sen.2[which(grepl("^[[:punct:]]",x100_itN_sen.2))] %>%
writeLines(.,con="x100_punctBegin.txt", useBytes = TRUE)
It consisted of 12,231 sentences and since they are not too many, I decided to remove them and I am left with 308,363 sentences.
x100_itN_sen.3 <- x100_itN_sen.2[which(grepl("^[[:punct:]]",x100_itN_sen.2)==FALSE)]
length(x100_itN_sen.3)
[1] 308363
Then I add the sentence boundary marks to complete the sentences.
x100_itN_sen.4 <- paste0(x100_itN_sen.3,"\u104b")
Now I thought it is done for good! Then I felt uneasy about the quotation marks in the sentences. What if I’ve deleted them where they need to be left alone? So I now check to see if there are pairs (even number of them) in each sentence. Casually picking up a number of them, I hit on a sentence quoting a review on an accomplished writer on the life of a fisherman.
kpat <- str_count(x100_itN_sen.4, pattern = "[\"]")
table(kpat %% 2 != 0)
FALSE TRUE
307935 1070
cat(x100_itN_sen.4[which(kpat %% 2 != 0)][225:229])
တံငါ ဘဝ သရုပ်ဖော် ဝတ္ထုများတွင် မြန်မာ စာပေ၌ ကြယ်နီသည် မီးရှူး တန်ဆောင် ဖြစ်ခဲ့သည်" ဟု ကေတု ဘုန်းမော်က မှတ်ချက် ပြုသည်။ သရုပ်ဖော် စာပေ ရေးသား ပြုစုရာမှာ များစွာ ပါရမီ ရင့်သန်တဲ့ စာရေး ဆရာ ကြယ်နီ" ဟု သိန်းဖေမြင့်က ချီးကျူး ခဲ့သည်။ တရွေးကို ဆယ်စိတ် စိတ်၍ ရေးသား ခဲ့သော ဆရာ ကြယ်နီ" ဟု တင့်တယ်က ထောမနာ ပြုခဲ့သည်။ ပညာပေး သက်သက် မဟုတ်ဘဲ အနုပညာ ရသ မြောက်အောင် ရေးကြရာတွင် ရေလုပ်သား ဘဝ တံငါ ဝတ္ထု၌ ကြယ်နီသည် ... စာဖတ် ပရိသတ်၏ စိတ်နှလုံးကို ယူကျုံး နိုင်သည် အထိ ကလောင်စွမ်း ထက်လှသည်" ဟူ၍ စာသုသီက သုံးသပ် ပြခဲ့သည်။ စာပေ ဂန္တဝင်နယ်၌ ကြယ်နီနီ ကလေးများ တဖျတ်ဖျတ် တောက်ပလျက်ကား ကျန်ရှိ ရစ်ခဲ့ ပါသည်" ဟူ၍ မင်းဏီက ရေးခဲ့သည်။
From the above it is clear that the opening quote mark was legitimately placed at the beginning of first sentence and ended in the middle of next sentence. And I had wrongly deleted the first quote mark! May be if I leave the opening quote mark, it would be fine. Or is it?
It would not be! Because:
That is from the page: x100[1612] on Myanmar writer U Latt. Because of breaking up of the individual sentences, the sentence bounded in red line has the opening quote mark, and the next sentence artifically got the first one’s closing quote mark as its opening quote mark (green line bounded sentence).
So the easiest and the correct solution would be not to break up the sentences, but to leave the paragraphs as-is. However, exactly the same difficulty could be found with English also and I don’t know how they would handle that. With my limited knowledge, I have seen punctuations, etc. removed from corpora before going on with NLP work (in English) and I don’t know if that is applicable here.
No comments:
Post a Comment