Saturday, February 2, 2019

Word Cloud with Myanmar Syllables

The Myanmar syllables obtained from swanhtet1992’s example text as reported in my last post also were saved to an R data file “SYLL.RData” with: save(SYLL, file = “SYLL.RData”). This post will share my experience of making word cloud from them using the quanteda package.
rm(list = ls())
load("SYLL.RData")
library(quanteda)
The retrieved data “SYLL” is a nested list and to create a word cloud from this data, it has to be formed into tokens (here, words) which were then converted to a “dfm”.
SYLL.c <- corpus(t(data.frame(lapply(SYLL, paste0, collapse = " "))))
SYLL.tw <- tokens(SYLL.c, what = "word")
dfmSYLL.tw <- dfm(SYLL.tw)
The data in the dfm form is plugged into textplot_wordcloud( ) function to produce a word cloud.To be able to replicate the word cloud by anyone a seed was specified.
set.seed(12)
textplot_wordcloud(dfmSYLL.tw, min_size = 0.5, max_size = 12, min_count = 1)
As you can see the problem with this plot is that Myanmar characters were not correctly shown.The documentation of the quanteda package as well as searching the Web were not helpful in resolving this problem. I wasn’s alone in struggling with this problem.There were various problems with non-English languages, Chinese, Indian, Korean, Spanish, Cryllic,… . The solutions offered were mostly for displaying text which I’ve more or less found the solution for Myanmar text with Unicode.However they remained ineffective with texts on graphics.Luckily after looking at the construction of one Chinese word cloud I realized that I would need to somehow let the word cloud function to know that I want to have the Myanmar Unicode font displayed on its plot. Also when I naively tried to run:
# textplot_wordcloud(dfmSYLL.tw, font = "Myanmar3", min_size = 0.5, max_size = 12, min_count = 1)
I got this error message, which also pointed out the right approach:
Myanmar3 is not found on your system. Run extrafont::font_import() and extrafont::loadfonts(device = “win”) to use custom fonts.
library(extrafont)
set.seed(12)
textplot_wordcloud(dfmSYLL.tw, font = "Myanmar3",min_size = 0.5, max_size = 12, min_count = 1)
Then I saw that this plot has an error, a character that is not a syllable, that is:  န်း. Checking the tokenization that was done with tokens(SYLL.c, what = “word”), I found the error.
 utf8::utf8_print(as.character(SYLL.tw[5]))
 [1] "ပိ"    "ဿာ"   "ချိန်"  "၁၀"   "သား"  "ရှိ"    "သော"  "ကြက်"  "သား"  "များ" "ချက်"  "ပြုတ်" 
[13] "ကျွေး" "မွေး"  "လှူ"    "ဒါ"   "န်း"   "သွား"  "သည့်"   "အ"    "တွက်"   "ကျေး" "ဇူး"   "တင်"  
[25] "ပါ"   "သည်"   "။"   
Using “fasterword” method instead of “word” with tokens( ) solved this problem. Also, color is added with brewer.pal( ) function. You can view the available color schemes with display.brewer.all().
SYLL.tfw <- tokens(SYLL.c, what = "fasterword")
dfmSYLL.tfw <- dfm(SYLL.tfw)
set.seed(12)
textplot_wordcloud(dfmSYLL.tfw, font = "Myanmar3", min_size = 0.5, max_size = 12, min_count = 1, color = RColorBrewer::brewer.pal(8, "Dark2"))

No comments:

Post a Comment