After the completion of my two naive projects on syllable segmentation and word segmentation of Myanmar text, I looked around for some big enough Myanmar corpus to work on. I heard that official font of Myanmar government is the Myanmar-3 Unicode font. So I visited the Information Ministry website because it may be feeding their public libraries with digital media. But I found most of them to be in non-Unicode formats. I had downloaded some notifications, policy briefs, and others from government and other state institutions before, but I was looking for more varied topics and styles of writing so I didn’t go for them this time. This time I came across a number of reports on NLP projects by Myanmar nationals, singly or in collaboration with locals, or with researchers from outside. I’ve read about their mouth-watering constructions of Myanmar corpora. But, as a rule, I couldn’t find a word about their accessibility either to the research community or to the public.
Then it was a consolation when I discovered the Asian Language Treebank (ALT) Project.
The ALT project aims to advance the state-of-the-art Asian natural language processing (NLP) techniques through the open collaboration for developing and using ALT. The project is a joint effort of the eight institutes, BPPT, I2R, IOIT, NECTEC, NIPTICT, PUP, UCSY, and NICT, for making a parallel treebank for ten languages: English, Filipino, Indonesian, Japanese, Khmer, Laotian, Malay, Myanmar, Thai, and Vietnamese. The process of building ALT began with sampling about 20,000 sentences from English Wikinews, and then these sentences were translated into the other seven and two languages. ALT will have word segmentation, part-of-speech (POS) tags, syntactic analysis annotations, together with word alignment links among these languages.
And also the John Okel’s A Corpus of Modern Burmese which originally was compiled in 1990s and converted to Unicode more recently. But before exploring them, I got the idea that Wikipedia in Myanmar language might be a great source for a corpus. Luckily or unluckily, I found the Wikimedia database dump of the Burmese Wikipedia on February 01, 2019, here. It boosted my fascination with this idea of creating a big corpus by myself. So I started working with some promising R packages like flatxml, xml2, and XML.
text <- flatxml::fxml_importXMLFlat("mywiki-20190201-pages-meta-history.xml")
## Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : internal error: Huge input lookup [1]
text <- xml2::read_xml("mywiki-20190201-pages-meta-history.xml")
## Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : internal error: Huge input lookup [1]
text <- XML::xmlParse(readLines("mywiki-20190201-pages-meta-history.xml"), asText=TRUE)
## Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
I guess that those error messages point to the inability of R to access the huge xml file (5.03 GB) via the given tools. Looking for answer through stackoverflow pages, I was reminded of the “Memory-limits” in R that says: “the number of bytes in a character string is limited to 2^31 - 1” or approximately 2 GB. In my earlier post of December 2014, Big data: small guys could do it? I had used the “ff” package and the “monetdb” to handle a big US Census ASCII data file. But I couldn’t see how the current huge XML file could be handle by them.
Then I looked around for some other software outside of R for processing XML data. I chose XMLStarlet command line XML toolkit. Now I renamed the long “mywiki-20190201-pages-meta-history.xml” to “1.xml”.
I run the command “xml el -a” to get the structure showing the “elements” as well as the “attributes” of the 1.xml file and asked the result to be written to the file “structure_1xml.txt”(file size: 186 MB). The structure shows nearly 6 million elements and attribute values. As Wikipedia explains, “XPath (XML Path Language) is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document”. According to XMLstarlet user manual, every line of output of structure produced by XMLstarlet is a valid XPath expression. However, you have to modify it as Jochen Hayek explains:
Well, after getting this structure, I tried to get the first article from 1.xml file by:
xml sel -t -v "//_:page[1]/revision/text" 1.xml
Which resulted in an “Out of memory” error!
Someone said, XMLstarlet “seems to load the whole file into memory, before applying the given Xpath expression”. So I am back to square one.
Someone said, XMLstarlet “seems to load the whole file into memory, before applying the given Xpath expression”. So I am back to square one.
However, for consolation, I tried XMLstarlet on a small file: Wikimedia incremental dump files for the Burmese Wikipedia on February 05, 2019 and found that it works.
utf8::utf8_print(readLines("d:/1_STARLET/incrRevComment.txt", encoding = 'UTF-8'))
[1] "File renamed. ([[Commons:Commons:GlobalReplace|GlobalReplace v0.6.5]])"
[2] "File renamed. ([[Commons:Commons:GlobalReplace|GlobalReplace v0.6.5]])"
[3] "/* ဆက်စပ်လေ့လာရန် */"
[4] "Reverted 1 edit by [[Special:Contributions/Yemyomyint78|Yemyomyint78]] ([[User talk:…"
[5] "/* လူဝတ်လဲခြင်း */"
[6] "[[Special:Contributions/113.249.61.56|113.249.61.56]] ([[User talk:113.249.61.56|ဆွေး…"
[7] "Citation added"
[8] "Citation added"
[9] "Citations"
[10] "clean up"
[11] "1"
[12] "Reverted 1 edit by [[Special:Contributions/45.112.177.45|45.112.177.45]] ([[User tal…"
[13] "/* ကိုးကား */"
[14] "[[User talk:Hue41]] စာမျက်နှာကို [[User talk:KEEPONandON]] သို့ Buggiaက ရွှေ့ခဲ့သည်: အသုံးပြုသူ \"[[S…"
[15] "/* ကိုးကား */"
[16] "[[Special:Contributions/119.76.122.248|119.76.122.248]] ([[User talk:119.76.122.248|…"
[17] "/* အုပ်စု (ခ) */"
[18] "/* အုပ်စု (က) */"
[19] "/* အုပ်စုအဆင့် */"
[20] "/* အုပ်စုအဆင့် */"
[21] "/* ရှုံးထွက်အဆင့် */"
[22] "ကြိုဆိုပါသည်!"
[23] "ကြိုဆိုပါသည်!"
[24] "ကြိုဆိုပါသည်!"
[25] "ကြိုဆိုပါသည်!"
[26] "ကြိုဆိုပါသည်!"
[27] "[[User talk:Hue41]] စာမျက်နှာကို [[User talk:KEEPONandON]] သို့ Buggiaက ရွှေ့ခဲ့သည်: အသုံးပြုသူ \"[[S…"
[28] "စာမျက်နှာကို [[ရှင်ဂုဏဓဇ]] သို့ ပြန်ညွှန်းလိုက်သည်"
[29] "\"ရှင်ဂုဏဓဇ (၁၁၅၃...\" အစချီသော စာလုံးတို့နှင့် စာမျက်နှာကို ဖန်တီးလိုက်သည်"
[30] "\"နာမည်...\" အစချီသော စာလုံးတို့နှင့် စာမျက်နှာကို ဖန်တီးလိုက်သည်"
[31] "ARM Architecture"
[32] "ကြိုဆိုပါသည်!"
[33] "ကြိုဆိုပါသည်!"
[34] "\"ဦးကြည် သည် သစ္စ...\" အစချီသော စာလုံးတို့နှင့် စာမျက်နှာကို ဖန်တီးလိုက်သည်"
[35] "/* ရှင်သာဏေဘဝ */"
[36] "/* စာအုပ်များ ရေးသားပြုစုခြင်း */"
[37] "ကြိုဆိုပါသည်!"
[38] "ကြိုဆိုပါသည်!"
[39] "ကြိုဆိုပါသည်!"
[40] "ကြိုဆိုပါသည်!"
[41] "ကြိုဆိုပါသည်!"
[42] "ကြိုဆိုပါသည်!"
[43] "စာမျက်နှာကို [[ရှင်ကဝိန္ဒာဘိ]] သို့ ပြန်ညွှန်းလိုက်သည်"
[44] "ပြန်ညွှန်းကို [[ရှင်ကဝိန္ဒာဘိ]] မှ [[ရှင်ကဝိန္ဒာဘိ (ပင်းဆရာတော်)]] သို့ ပြောင်းလဲခဲ့သည်"
[45] "\"'''ရှင်ကဝိန္ဒဘိ'''...\" အစချီသော စာလုံးတို့နှင့် စာမျက်နှာကို ဖန်တီးလိုက်သည်"
[46] "ကြိုဆိုပါသည်!"
[47] "ကြိုဆိုပါသည်!"
[48] "ကြိုဆိုပါသည်!"
[49] "ကြိုဆိုပါသည်!"
[50] "ကြိုဆိုပါသည်!"
[51] "\"စာရေးဆရာ သမိုင...\" အစချီသော စာလုံးတို့နှင့် စာမျက်နှာကို ဖန်တီးလိုက်သည်"
[52] "ကြိုဆိုပါသည်!"
So, I’ll have to find a better way to handle huge XML files. I guess it would involve some method that don’t need to read the entire file into memory before processing the XML content.
An ambitious big Myanmar corpus–as yet unsuccessful.