I’m gasping in my little pool of NLP, kicking, but comfortably afloat and enjoying tremendously.
In the middle of all this, let me take a breather and give some tips to you, only if you are interested in checking or replicating what I’ve demonstrated so far with my explorations in Myanmar Wikipedia data dump.
My little ambition
When I learnt to read about NLP and particularly, NLP of our Myanmar language by our people, I found only those works done by IT professionals and researchers. That’s what was available by Googling and I might have missed out those published by our linguists and other researchers in professional journals. I don’t know if our IT people working in NLP has collaboration with a wide section of our people in linguistic and related studies. For those in the latter group, and also Myanmar language enthusiasts in the general public, young and old, it is with them that I want to share what I am doing naively, and mostly for fun. Let them stand on the shoulders of a midget, for the time being, and if the giants were not around.
Replicate and disassemble
If you are smart enough to be able to take a look at what I’ve done, just shrug, and demonstrate a finer option, it will be all the better. But if you would like to replicate what I’ve done, then take them apart into pieces to try to build your own I would heartily encourage you.
Some tips for R rookies (including myself)
Use RStudio if you are doing NLP for Myanmar language
Among those of you who are new to R and find unreadable characters on the screen, probably you are using the default GUI that comes with R. I had posted about it in my post, “Myarmar-Sar in R IV: RStudio to the Rescue” of August 8, 2016. There, I quoted Yin Zhu’s post Unicode Tips in Python 2 and R of July 9, 2013 on R-bloggers:
Use a Unicode terminal and a Unicode text editor when working with Python and R. For example, RStudio is, while Rgui.exe isn’t. PyDev plugin/PyScripter is, while the default IDLE isn’t.
Use cat() function or utf8_print() function to see Myanmar Unicode characters
The following is output on the Rgui console:
The following output is on the RStudio console:
Getting Wordcloud with Myanmar characters right
When I was able to display Myanmar characters on the RStudio console, I thought that the wordcloud with Myanmar syllables would be a piece of cake. But that was not to be as you can see below. In the following plot, the top one was what I got when I run my wordcloud code in the usual way. Let me quote my comments:
As you can see the problem with this plot is that Myanmar characters were not correctly shown.The documentation of the quanteda package as well as searching the Web were not helpful in resolving this problem. I wasn’s alone in struggling with this problem.There were various problems with non-English languages, Chinese, Indian, Korean, Spanish, Cyrillic,… . The solutions offered were mostly for displaying text which I’ve more or less found the solution for Myanmar text with Unicode. However they remained ineffective with texts on graphics.
The next one was when I could do it correctly. Thanks to an article on Chinese wordcloud that made me realize that I would need to somehow let the word cloud function to know that I want to have the Myanmar Unicode font displayed on its plot. See my post Word Cloud with Myanmar Syllables to see that solution.
There’s more than one way
From synonyms.com:
What I like about R is that you can do the same thing with different codes. For example, to see the syllables with highest frequencies, you can use the topfeatures( ) function of quanteda. But not all the text is display as Myanmar language characters. The topfeatures() function produces what is known as “named numbers”. One solution I found involves (i)converting the named numbers to a data frame (solution by Mark Needham), (ii)concatenate data in each row of the data frame, and (iii)print with utf8_print() function. Here’s how it was done in one of my posts:
tf <- topfeatures(dfmsyll.tfw, n=100)
df.tf <- data.frame(name = names(tf), n = tf, stringsAsFactors = F)
utf8::utf8_print(do.call("paste",c(sep = " - ", df.tf)))
That time, I didn’t know that quanteda package has the textstat_frequency() function that can do the same thing! So, the following could give the same kind of output in a later post:
senEnd_tf <- textstat_frequency(senEnd)[,1:2]
utf8::utf8_print(do.call("paste", c(sep= " - ", senEnd_tf)))
Perhaps I forgot how I did it earlier.
For NLP or in a broader sense handling of strings, I found regular expression handling functions from the R’s base package indispensable. Much of the stringr package can do the same thing, but sometimes in a more convenient way. May be it is like reinventing a square wheel to a polygonal form and also to a perfectly circular form. That’s the beauty of R, I guess.
Cesi pas n’est un pipe
Inspired by Magrittr package of R, I posted about an old Myanmar poem, “Stumpy pipe”, instead of using %>%. I couldn’t know how much it would be useful until I am using something like a long series of character substitutions like this:
# removed the hyperlink markers and other unneccessary characters
x100_itN_sen.2 <- gsub("\\[\\[[^\u1000-\u104f]+\\]\\]","",x100_itN_sen.1) %>%
gsub("\\[","",.) %>%
gsub("\\]","",.) %>%
gsub("['|]","",.) %>%
gsub("\\(\\{\\{.+\\}\\}\\)","",.) %>%
gsub("\\{\\{[A-Za-z]+ .*[A-Za-z]+\\}\\}\n+", "", .) %>%
gsub("<.*>.+", "", .) %>%
gsub("\n","", .) %>%
gsub("[A-Za-z]+\\{+.+\\}+","", .) %>%
gsub("\\{+.+\\}+","", .) %>%
gsub("^[\\#]","", .) %>%
.[which(nchar(.)>50)]
})
Note that this piping function is also available with the quanteda package.
Sources of inspiration
Moral: Yan can cook
Shakespeare: A rose by any other name …
DIY: bigglm on your big data set in open source R, it just works – similar as in SAS
Shakespeare: A rose by any other name …
DIY: bigglm on your big data set in open source R, it just works – similar as in SAS
Motivation powered by:
Asian Language Treebank (ALT) Project
A Corpus of Modern Burmese
Various articles on creation of corpus from English Wikipedia.
A Corpus of Modern Burmese
Various articles on creation of corpus from English Wikipedia.
No comments:
Post a Comment