Saturday, March 30, 2019

Cleaning the raw Myanmar Wikipedia corpus

My raw corpus as reported in my last post contains headings and footnotes, references, headings and text for items not forming part of an article, and other text. I’ve tried reading a bit about what makes a corpus, but mostly tl;dr. Then I boldly assumed that if the complete body text could be extracted for the articles in the raw corpus it would be done, at least as the first step.
As I see it, how this step could be carried out is quite simple. First, you open the raw corpus text file in Notepad++. Then scroll down the lines of text until you find an article. For example, in my raw corpus text file, you’ll find the beginning of the body text of the first article on electronics at line 85. You can see that it ends at line 162. Now, what you can do is to select this body text, open a new blank file and paste the contents and save it. Then copy the file to a new name; open both files in Notepad++; combine the lines by placing the cursor at the beginning of the second line and pressing the backspace key to get a whole sentence or a paragraph. The following is an example for a short article on a Myanmar tree species beginning on line 511929.
This article after selecting text from the main raw file, saved to a new text file, copied and edited:
I think such operation (additionally with a little bit of editing, if necessary) could be done without trouble by anyone with just a bit of experience in using a PC. But for the Myanmar language Wikipedia dump with 44,000 plus articles, the whole job would be immense. But that’s exactly why I thought tackling this job through crowdsourcing would be the ideal approach.
Instead of the purely manual approach for cleaning the text as shown for the example above, you could programmatically concatenate all the text in an article into one line, using R, for example. In the screenshot above, the text in the original in the left pane is concatenated into 12 lines (paragraphs) in the right pane. Here we put them into just one line:
load("mywiki-20190201-textDataSets.rda")
ingyinText <- paste0(textMyNbl[511929:511982], collapse="")
str(ingyinText)
 chr "<U+1021><U+1004><U+103A><U+1000><U+103C><U+1004><U+103A><U+1038><U+1015><U+1004><U+103A><U+101E><U+100A><U+103A"| __truncated__
Write to text file:
writeLines(ingyinText, con="ingyinText.txt", useBytes=TRUE)
The text file opened in Notepad++ shows that all the text is put into one line. Some syllables appeared to be broken up, but when you open the file in ordinary Notepad, you’ll see it is OK.
This operation is simple enough for an article if you can identify the beginning line and ending line for it. But that’s not so easy when you try going down the lines in my raw corpus file opened in Notepad++ to do just that. As I’ve shown below, I found the first two articles with headings in line-82, and line-166, with endings in line-165 and 496, quite easily. But the next one on the Continent of Asia was quite hard. With most articles you will need to delete the paragraph-headings, headings for graphics, tables of data, etc. before the remaining body-text could be concatenated. I guess this kind of tidying up could only be done manually!