Monday, January 1, 2024

Problem of using the punctuation character class in regex with Burmese text


I managed somehow to create a corpus from the dump file of the Myanmar language Wikipedia back in 2019 using the “xml2” package and other tools of the R programming enivironment. Four years on and feeling a little more confident, I am trying again with the Wikipedia dump file of 2023.

Looking around for resources on the internet, I was lucky to find out here that Lindemann had created multilingual corpus from Wikipedia files including that for the Burmese Wikipedia.

Lindemann’s set of “fully cleaned text corpus (110 MB)” was downloaded from here.

After getting that file and having extracted the Burmese corpus, in just the first few lines of text, I noticed “၎” missing from what was to be “၎င်း”.
Then I located the corresponding article in the Burmese Wikipedia and found that “၍” was also missing.

Extract from Lindemann’s corpus
Extract from Lindemann’s corpus


Wikipedia article extract
Wikipedia article extract


My hunch was that Lindemann might be using the puntuation character class “[[:punct:]]” in regex syntax to remove all the punctuation. Unfortunately in my work I found that using “[[:punct:]]” would remove the Burmese punctuation “၊” and “။” as well as those that are not, namely, any of “၌၍၎၏”.

So, I tried searching for all those characters in his Burmese corpus. And I could find none!

No comments:

Post a Comment