I had used the following code to get the text from WIkipedia dump data to base my current round of work on a Myanmar Wikipedia corpus.
library(xml2)
system.time(
xdoc <- read_xml("mywiki-20190201-pages-articles.xml", encoding = "UTF-8")
)
xml_ns(xdoc)
# find nodesets for robot translated articles
botTrans <- readLines("botTranlateWiki Apr20_2019.txt",encoding="UTF-8")
library(xml2)
nodes2rm.1 <- list()
text2rm.1 <- list()
L <- length(botTrans)
system.time(
for (i in 1:L){
tmp <- xml_find_all(xdoc, paste0("//d1:page[./d1:title = '",botTrans[i],"']"))
nodes2rm.1[[i]] <- tmp
}
)
nodes2rm.2 <- xml_find_all(xdoc, "//d1:page[./d1:title = 'ဘိုးအင်း ၇၇၇' or ./d1:title = 'Vrahovice' or ./d1:title = 'Alaska Thunderfuck' or ./d1:title = 'Abir']")
# remove robot translated nodesets
lapply(nodes2rm.1, xml_remove)
lapply(nodes2rm.2, xml_remove)
This resulted in the xdoc object which represents the original Wikipedia dump XML file without the robot-translated articles.
At the beginning, I used a naive, crude approach to extract text from the Wikipedia dump file, that is, using regular expression matching to remove all characters that is not Myanmar Unicode. That time I had no idea at all of XML or how Myanmar Wikipedia is organized. Then I got to know a bit of XML and of Wikipedia. Now I am beginning to see that the right way to go would be, firstly, to extract nodesets from the original XML file that would give as close as possible the body text of articles. This is best done with XML tools, and I am using the xml2 R package. As I come to know, various subjects of the Myanmar Wikipedia were organized by page(s) and they also contain text we don’t need. So, to clean up the collection of text I’ve extracted, I removed the “pages” translated by robots.
To clean the pages further, I collected the text nodesets from xdoc and then converted it into a character vector with xml_text() function. The resulting character vector allows me to use standard R tools on it with ease. Now I have a character vector of length 71,392.
system.time({
xdoc_bt <- xml_find_all(xdoc, "//d1:text")
xdoc_bt.v <- xml_text(xdoc_bt)
})
user system elapsed
6.79 0.11 6.94
str(xdoc_bt.v)
chr [1:71392] "#REDIRECT [[<U+1017><U+101F><U+102D><U+102F><U+1005><U+102C><U+1019><U+103B><U+1000><U+103A><U+1014><U+103E><U+102C>]]" ...
Then my casual viewing of pages shown me that there are #REDIRECT and #Redirect pages. I am not sure if they are different so I tried to find out:
length(xdoc_bt.v[which(grepl("#Redirect", xdoc_bt.v) & grepl("#REDIRECT", xdoc_bt.v))])
[1] 0
length(xdoc_bt.v[which(grepl("#Redirect", xdoc_bt.v))])
[1] 731
length(xdoc_bt.v[which(grepl("#REDIRECT", xdoc_bt.v))])
[1] 8458
length(xdoc_bt.v[which(grepl("#Redirect", xdoc_bt.v) | grepl("#REDIRECT", xdoc_bt.v))])
[1] 9189
Looking at their text, it seems majority of them have no article of their own, but directed to some other page. So I removed them. Though this action would have some loss, like for example, the declaration of Independence of Myanmar.
xdoc_rdrN <- xdoc_bt.v[which(!(grepl("#Redirect", xdoc_bt.v) | grepl("#REDIRECT", xdoc_bt.v)))]
length(xdoc_rdrN)
[1] 62203
cat(xdoc_rdrN[30000])
'''ဘိုကုန်းရွာ၊ ဘိုင်းဒေါင့်ချောင်း'''
{{Infobox settlement
|official_name = ဘိုကုန်း
|pushpin_label_position = bottom
|pushpin_map = မြန်မာနိုင်ငံ
|pushpin_map_caption = ဘိုကုန်း တည်နေရာ၊ မြန်မာ။
|pushpin_mapsize = 300
|subdivision_type = နိုင်ငံ
|subdivision_name = {{flag|မြန်မာနိုင်ငံ}}
|subdivision_type1 = [[မြန်မာနိုင်ငံ တိုင်းဒေသကြီးများ|တိုင်းဒေသကြီး]]
|subdivision_name1 = [[ဧရာဝတီတိုင်းဒေသကြီး]]
|subdivision_type2 = [[မြန်မာနိုင်ငံ ခရိုင်များ|ခရိုင်]]
|subdivision_name2 = [[လပွတ္တာခရိုင်]]
|subdivision_type3 = [[မြန်မာနိုင်ငံ မြို့နယ်များ|မြို့နယ်]]
|subdivision_name3 = [[လပွတ္တာမြို့နယ်]]
|subdivision_type4 = [[ကျေးရွာအုပ်စု]]
|subdivision_name4 =ဘိုင်းဒေါင့်ချောင်း<ref>GAD, Feb 2011</ref>
|latNS = N
|latd = 16.04450
|longEW = E
|longd = 94.76801
|P-code = 150719
}}
==ကိုးကား==
<references/>
[[Category:မြန်မာနိုင်ငံ ရွာများ]]
[[Category:ဧရာဝတီတိုင်းဒေသကြီးရှိ ရွာများ]]
[[Category:BotUpload]]
This suggests that I could also remove pages with “[[Category:BotUpload]]”. I guess it marks pages uploaded by robot and it seems unlikely that they may contain many Myanmar sentences. Also, there are {{stub}} pages that still need expansion and improvement, and they look like they are standard features of Wikipedia. I’m removing both categories of pages.
xdoc_rdrN_bupStubN <- xdoc_rdrN[which(!(grepl("\\[\\Category:BotUpload\\]\\]", xdoc_rdrN) | grepl("stub\\}\\}", xdoc_rdrN)))]
length(xdoc_rdrN_bupStubN)
[1] 37808
There also were English only pages and pages that include Myanmar characters but without complete sentences. These could be removed by just taking pages that contain at least one Myanmar sentence.
table(grepl("\u104b", xdoc_rdrN_bupStubN))
FALSE TRUE
16678 21130
xdoc_rb_mySenY <- xdoc_rdrN_bupStubN[which(grepl("\u104b", xdoc_rdrN_bupStubN))]
length(xdoc_rb_mySenY)
[1] 21130
After a series of eliminations, we got 21,130 pages with one or more Myanmar sentences not translated or uploaded by robots, and not classed as stubs.
It would now be interesting to find out the number of sentences in each page by counting the Myanmar sentence boundary mark, by using the str_count() function of stringr package. You can see the astonishing fact that one page contains 5,767 Myanmar sentences!!
nsenMark <- stringr::str_count(xdoc_rb_mySenY,"\u104b")
summary(nsenMark)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 8.00 22.27 23.00 5767.00
maxMySen <- xdoc_rb_mySenY[which(nsenMark == max(nsenMark))]
writeLines(maxMySen, "maxMySen.txt", useBytes = TRUE)
We abitrarily pick only those pages with more then five Myanmar sentences.
xdoc_rb_mySenY_gt5 <- xdoc_rb_mySenY[which(nsenMark > 5)]
length(xdoc_rb_mySenY_gt5)
[1] 12332
nsenMark.1 <- stringr::str_count(xdoc_rb_mySenY_gt5,"\u104b")
summary(nsenMark.1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 12.00 20.00 36.53 39.00 5767.00
sum(nsenMark.1)
[1] 450435
We ended up with 12,332 pages, each containing six Myanmar sentences or more. The total number of sentences is 450,435. This is clearly not the end. There is more to do for cleanup, like the titles you can see in the screenshot, for example.
No comments:
Post a Comment