Thursday, April 4, 2019

A quick and dirty Myanmar Wikipedia corpus


My raw Myanmar Wikipedia corpus when written out has 1.3 million lines with a total length of 209.9 million characters. My last post looked at some practical aspects of cleaning this corpus manually. In that process it became clear to me that members of the imagined “crowd” to handle the tidying up of the raw corpus could have a hard time.
To start with, identifying the boundaries of the body text of the 44000 plus articles could be some problem! I’d looked at the article-titles file available at the same Wikimedia dump as the source for my raw corpus. But the titles were arranged in alphabetical order and not in the order of titles/articles as they were found in the corpus source. Then it was quite easy to generate the “titles” from the XML source file (mywiki-20190201-pages-articles.xml) so that I have the order of titles and text of articles in agreement. But then, titles include title of articles, graphs, data tables, pictures, etc. Also, it wasn’t that easy to identify which of them is the title proper of the article as noted in my last post. At the end, the list of titles wasn’t of much help.
In the following screenshots, the one on the left shows the list of titles from the file “mywiki-20190201-all-titles.gz” downloaded from the Wikimedia dump. One on the right is generated by running XMLstarlet software within R:
titles <- system('d:/0_starlet/xml sel -t -v "//_:title" mywiki-20190201-pages-articles.xml', intern = TRUE)
writeLines(titles, con ="myWikiTitle.txt",useBytes = TRUE)
So, I thought: what we are trying to do is to get a big corpus from Myanmar Wikipedia articles, right? It won’t matter if we couldn’t create a corpus containing the complete text of each and every article. If so I could just try to extract complete paragraphs or sentences, with R, and that it will be the easy way out. After that I would claim my product is the result of citizen-coding (performed by a dummy) to remind everyone of its possible imperfections.
Let me call this corpus myWiki-QDcorpus. Here is the story.
First, I retrieved my raw text data sets which I had placed in the public domain.
load("mywiki-20190201-textDataSets.rda")
ls()
[1] "textMy"    "textMyNbl" "textU"    
The R object “textMyNbl” is a vector. Each of it elements is a string of text created from the corresponding line of text in the original Wikimedia dump XML file. Here’s some characteristics of this vector. It has a length of 1,294,632 (number of elements) which could be written out as that many number of lines to a text file.
str(textMyNbl)
 chr [1:1294632] "<U+101D><U+102E><U+1000><U+102E><U+1015><U+102E><U+1038><U+1012><U+102E><U+1038><U+101A><U+102C><U+1038>" ...
You could get the number of characters in a line of text by using the nchar( ) function. For a string of Myanmar language text, a character is equivalent to a Unicode code point. You can look at the summary of the number of characters for each element of the vector:
x <- nchar(textMyNbl)
length(textMyNbl)
[1] 1294632
sum(x)
[1] 69102916
summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    1.00    12.00    23.00    53.38    46.00 10087.00 
You can look at the distribution of the number of characters per element. That means the distribution of Unicode code points per line/paragraph of text, when the vector is written out to a text file.
plot(density(x))
polygon(density(x), col="red", border="blue")
Now I extracted only the elements that end with Myanmar section sign “။”(\u104b) to get only the sentences or paragraphs in Myanmar language. Now I am left with 187,212 elements containing 35 million characters. Again you can look at the distribution of number of characters per element.
textMyNbl.s0 <- textMyNbl[grep("\u104b$",textMyNbl)]
y <- nchar(textMyNbl.s0)
length(textMyNbl.s0)
[1] 187212
sum(y)
[1] 35381034
summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1      30      86     189     269   10087 
plot(density(y))
polygon(density(y), col="red", border="blue")
Now the remaining text will certainly have some elements that consisted of fragments of sentences rather than complete sentences and you couldn’t know their number for certain without actually reading through them! So I arbitrarily take elements that contain more than 100 characters and hoped that it will give me only the complete sentences or paragraphs. So, that was how I arrived at my quick and dirty Myanmar Wikipedia corpus! Try it for yourselves and improve on it.
textMyNbl.s1 <- textMyNbl.s0[which(nchar(textMyNbl.s0)>100)]
z <- nchar(textMyNbl.s1)
length(textMyNbl.s1)
[1] 87405
sum(z)
[1] 31585702
summary(z)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  101.0   170.0   288.0   361.4   466.0 10087.0 
plot(density(z))
polygon(density(z), col="blue", border="red")
This quick and dirty corpus has 87,405 lines of text with a total length of 32 million characters.
This “myWiki-QDcorpus.txt” file is available for download here
writeLines(trimws(textMyNbl.s1), con="myWiki-QDcorpus.txt", useBytes = TRUE)

No comments:

Post a Comment