In my last post I talked about my unsuccessful attempt to create a corpus from all the articles in Myanmar language Wikipedia. I had used the Wikimedia database dump of the Burmese Wikipedia on February 01, 2019, from here. While some people were claiming a couple of hundreds of megabytes of XML content to be huge, I was trying to handle 5.03 GB! No wonder, I failed. Looking around on the Web, I came to understand that to be able to handle such a huge XML file, I would have to use a different set of tools like those created with the Python programming language. Or else if I stick to the R and its XML package, I would have to use what is known as SAX parsing for XML. That is available through the xmlEventParse( ) function, and I even tried using it. With it I was able to extract text from the huge XML file. But that file I was using: “mywiki-20190201-pages-meta-history.xml” contains all the revisions of all the articles as well as the associated metadata. And I couldn’t find the way to get only the body text of the latest version of the articles.
The breakthrough came when I was boring through the Wikimedia dump documentation and happened to notice the information that there should be a … -pages-articles.xml file in bz2 format that contains the Wikipedia articles only. Last time, I missed this mywiki-20190201-pages-articles.xml.bz2 file because I didn’t click the “Show all files” button on the download page. You can download it here.
Then it was quite a relief, when I found that the size of the unzipped file mywiki-20190201-pages-articles.xml is “only” 295MB. With such a “small”" size, R could be able to read-in the whole file into memory and then I may be able to use regex to extract only the Myanmar language text. Here goes!
textU <- readLines(con="mywiki-20190201-pages-articles.xml", encoding="UTF-8")
incomplete final line found on 'mywiki-20190201-pages-articles.xml'
str(textU)
chr [1:3851785] "<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.10/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-ins"| __truncated__ ...
Reading xml file took 9.01 seconds.It resulted in a character vector of length = 3,851,785. Viewing the first six lines of text, and the 5305th line:
textU[c(1:6,5305)]
[1] "<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.10/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd\" version=\"0.10\" xml:lang=\"my\">"
[2] " <siteinfo>"
[3] " <sitename>ဝီကီပီးဒီးယား</sitename>"
[4] " <dbname>mywiki</dbname>"
[5] " <base>https://my.wikipedia.org/wiki/%E1%80%97%E1%80%9F%E1%80%AD%E1%80%AF%E1%80%85%E1%80%AC%E1%80%99%E1%80%BB%E1%80%80%E1%80%BA%E1%80%94%E1%80%BE%E1%80%AC</base>"
[6] " <generator>MediaWiki 1.33.0-wmf.14</generator>"
[7] "[[အင\u103a္ဂလိပ\u103a]] အခေ\u102b\u103aအဝေ\u102b\u103a Telecommunication ဆိုသော စကားလုံးမ\u103eာ [[ပ\u103cင\u103aသစ\u103a]]စကားလုံး télécommunication မ\u103eရယူထားခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။ ထိုစာလုံးမ\u103eာ ဝေးက\u103dာသောနေရာ ဟု အဓိပ္ပ\u102bယ\u103aရသော [[ဂရိဘာသာစကား|ဂရိ]]စာလုံး tele- (τηλε-) န\u103eင့\u103a ဝေမ\u103b\u103eသုံးစ\u103dဲရန\u103a ဟု အဓိပ္ပ\u102bယ\u103aရသော လက\u103aတင\u103aစကားလုံး communicare တို့ကို ပေ\u102bင\u103aးစပ\u103aထားခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။ ပ\u103cင\u103aသစ\u103aစာလုံး télécommunication မ\u103eာ ပ\u103cင\u103aသစ\u103aနိုင\u103aငံသား [[အင\u103aဂ\u103bင\u103aနီယာ]]န\u103eင့\u103a ဝတ္တုရေးဆရာဖ\u103cစ\u103aသော Édouard Estaunié မ\u103e ၁၉၀၄ ခုန\u103eစ\u103aတ\u103dင\u103a စတင\u103aသုံးစ\u103dဲခဲ့ခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။"
Next we use the regular expression matching to replace the text of all characters that is not Myanmar Unicode character with blank. This took 46.63 seconds. Again, we look at the same lines as above:
textMy <- gsub("[^\u1000-\u104f]", "", textU)
textMy[c(1:6,5305)]
[1] ""
[2] ""
[3] "ဝီကီပီးဒီးယား"
[4] ""
[5] ""
[6] ""
[7] "အင\u103a္ဂလိပ\u103aအခေ\u102b\u103aအဝေ\u102b\u103aဆိုသောစကားလုံးမ\u103eာပ\u103cင\u103aသစ\u103aစကားလုံးမ\u103eရယူထားခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။ထိုစာလုံးမ\u103eာဝေးက\u103dာသောနေရာဟုအဓိပ္ပ\u102bယ\u103aရသောဂရိဘာသာစကားဂရိစာလုံးန\u103eင့\u103aဝေမ\u103b\u103eသုံးစ\u103dဲရန\u103aဟုအဓိပ္ပ\u102bယ\u103aရသောလက\u103aတင\u103aစကားလုံးတို့ကိုပေ\u102bင\u103aးစပ\u103aထားခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။ပ\u103cင\u103aသစ\u103aစာလုံးမ\u103eာပ\u103cင\u103aသစ\u103aနိုင\u103aငံသားအင\u103aဂ\u103bင\u103aနီယာန\u103eင့\u103aဝတ္တုရေးဆရာဖ\u103cစ\u103aသောမ\u103e၁၉၀၄ခုန\u103eစ\u103aတ\u103dင\u103aစတင\u103aသုံးစ\u103dဲခဲ့ခ\u103cင\u103aးဖ\u103cစ\u103aသည\u103a။"
Now we strip all blank lines from a total of 3,851,785 from the text. The result is 1,294,632. Note that one “line” will have a minimum of one character to multiple characters in multiple sentences (or equivalent), or fragments of sentences.
textMyNbl <- textMy[!textMy==""]
tail(textMyNbl)
[1] "စတုတ္ထ"
[2] "၂၀၁၈၂၀၁၈ပ\u103cည\u103aနယ\u103aန\u103eင့\u103aတိုင\u103aးအသက\u103a၁၄န\u103eစ\u103aအောက\u103aဘောလုံးပ\u103cိုင\u103aပ\u103dဲအသေးစိတ\u103a"
[3] "ရန\u103aကုန\u103aမ\u103cို့"
[4] "ပူးတ\u103dဲ"
[5] "၁၇"
[6] "မ\u103cန\u103aမာနိုင\u103aငံရ\u103eိဘောလုံးပ\u103cိုင\u103aပ\u103dဲမ\u103bား"
length(textMyNbl)
[1] 1294632
We could now save it to a text file: myPageArticles-rawCorpus.txt The size of this “raw corpus” is 200.2 MB.
writeLines(textMyNbl, con = "myPageArticles-rawCorpus.txt", sep = "\n", useBytes = TRUE)
This file is too large for ordinary Text Editors like Notepad to open. You could use Notepad++, which is downloadable here:
Notepad++: a free source code editor which supports several programming languages running under the MS Windows environment.
It took a while to open the file: myPageArticles-rawCorpus.txt in Notepad++. The following gives sample screen shots of the beginning and end of file and two inbetween.
As you can see it is not easy to separate body text of the articles from Wikipedia’s instruction to prospective administrators, text headings, image headings, assorted text, and metadata. While it is tempting to use “Myanmar Section sign, Unicode code point U+104B” at the end of a text line to identify complete sentence(s), the third screenshot shows that this action could omit a lot of text! And in some cases, a heading will appear as a sentence as the highlighted text shows.
However, retaining only the complete sentence(s) in a line of text could still gives a huge corpus compared to the Asian Language Treebank(ALT) corpus or John Okell’s corpus mentioned in my last post. But then we can easily guess that some sentence-fragments will also be included as it is obvious from the above screenshots.
Also, I guess it would be too hard, or almost impossible (using programs) to clean such a raw corpus to get good clean sentences. But those of you who would like to give a try, or would just play around with my “raw” corpus: myPageArticles-rawCorpus.txt, I’m sharing it with you here. If you want to use the three text-datasets, namely, “textU”, “textMy”, and “textMyNbl” in R data format, download the single file containing them (mywiki-20190201-textDataSets.rda) here.
Seriously, I hope some of our own young people (or my fellow dummies, or older folks) would be interested in trying to organize data cleaning and then annotating the raw corpus through the power of the masses, that is, crowdsourcing.
No comments:
Post a Comment