Tuesday, February 19, 2019

Ayarwady and Sad WORDS


These days others were thinking hard about Ayarwady. Time to time we have to think about Ayarwady too. In reality, we are the only ones who will make or break Ayarwady. That being clear, I hope that I am not going to hear sad news about Ayarwady any time soon, or in what will be left of my lifetime.
In the context of a few posts back in this blog, I was fascinated by the idea of sentiment analysis of comments like those in Facebook, if possible in our own Myanmar language. That led me into playing with syllable segmentation and word segmentation on Myanmar text. While I was very much enjoying that, some articles on emotion detection in text caught my eyes. I was excited. Never heard of it before. Could dummies do that? Let me give a try.
In fact, these words from Emotion Detection in Text: a Review by Seyeditabari and others motivated me to leapfrog sentiments analysis in favor of emotion detection in my typical dummy-way:
“Emotion detection in computational linguistics is the process of identifying discrete emotion expressed in text. Emotion analysis can be viewed as a natural evolution of sentiment analysis and its more fine-grained model. … Sentiment analysis, with thousands of articles written about its methods and applications, is a well established field in natural language processing. It has proven very useful in several applications such as marketing, advertising … question answering systems … summarization … as part of recommendation systems … or even improving information extraction … . On the other hand, the amount of useful information which can be gained by moving past the negative and positive sentiments and towards identifying discrete emotions can help improve many applications mentioned above, and also open ways to new use cases”.
With a craze, more or less, I began by downloading the Wikipedia page on our untouchable Ayarwady (Irrawaddy) river. Actually I was sidetracked into this also by my frustrations from the failure to extract text out of the the entire set of Myanmar Wikipedia pages. I was unhappy about the scarcity of Myanmar language text collection, or Myanmar language corpus. as the NLP community calls it. Therefore, I’m going to create a big Myanmar corpus out ot the Wikipedia Myanmar pages, since that is a good sizable collection of Myanmar language in Unicode, I thought. Then this task proved to be too big for my tools at hand. But that certainly calls for another story.
Anyway, to make the story short, I choose to try out the Syuzhet R-package for detecting emotions from the Wikipedia article on the Ayarwaddy river. It was quite easy. It got no brain-work of mine. I just followed the examples given in the vignette (Introduction to the Syuzhet Package) that comes with the installation of the package.


Getting the data ready for processing

To analyze the data in sentence form in Syuzhet, you will have to begin by using get_sentences() function on your text. That will tokenize your text into sentences which means that a vector of sentences will be created.
For the following exercise, you need to have the R packages, syuzhet and htm2txt installed on your computer.
library(syuzhet)
# Loads a file as a single text string.
ayarT <- get_text_as_string("https://en.wikipedia.org/wiki/Irrawaddy_River")
# Get tokenized sentences from ayarT.
ayar_sen <- get_sentences(ayarT)
head(ayar_sen)
The above code block imports the Wikipedia page as a single text string and tokenize it into sentences. This text string contains html tags which need to be removed.So instead of it, I used htm2txt package to import plain text from the webpage and then saved it as a utf-8 format text file.
x <- htm2txt::gettxt("https://en.wikipedia.org/wiki/Irrawaddy_River")
str(x)
 chr "Irrawaddy River\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJump to search\n\n\"Ayeyarwady\"| __truncated__
I saved it to a text file.
zz <- file("ayar0.txt", "w")
writeLines(x, con=zz, useBytes = TRUE)
close(zz)
The plain text file contained 582 lines of text including blank lines.I deleted parts of the text by hand to leave only the body of the text. For the reproducibility of the results posted here, I’ve made this file available here.


Detecting emotions from the text

library(syuzhet)
package 㤼㸱syuzhet㤼㸲 was built under R version 3.5.2
# Loads the file as a single text string.
y <- get_text_as_string("ayar.txt")
# tokenize sentences
y_sen <- get_sentences(y)
str(y_sen)
 chr [1:121] "Irrawaddy River  The Irrawaddy or, officially, Ayeyarwady[4] River (Burmese: á\200§á\200›á\200¬á\200\235á\200\2"| __truncated__ ...
head(y_sen)
[1] "Irrawaddy River  The Irrawaddy or, officially, Ayeyarwady[4] River (Burmese: ဧရာá€\u009dá€\u0090ီမြစ်; MLCTS: erawa."                                       
[2] "ti mrac, pronounced [Ê\"èjàwÉ™dì mjɪÊ\"], also spelt Ayeyarwaddy[citation needed]) is a river that flows from north to south through Myanmar."                          
[3] "It is the country's largest river and most important commercial waterway."                                                                                                  
[4] "Originating from the confluence of the N'mai and Mali rivers,[5] it flows relatively straight North-South before emptying through the Irrawaddy Delta into the Andaman Sea."
[5] "Its drainage basin of about 404,200 square kilometres (156,100 sq mi) covers a large part of Burma."                                                                        
[6] "After Rudyard Kipling's poem, it is sometimes referred to as 'The Road to Mandalay'."                                                                                       
To identify emotions in the sentences the NRC Emotion lexicon by Saif MOhammad is used. According to him, “the NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive)” In the data frame returned by the function call below, each row represents a sentence from the originak text showing number of detection(s) by each emotion/sentiment.
nrc_data <- get_nrc_sentiment(y_sen)
package 㤼㸱bindrcpp㤼㸲 was built under R version 3.5.2
head(nrc_data)
anger
<dbl>
anticipation
<dbl>
disgust
<dbl>
fear
<dbl>
joy
<dbl>
sadness
<dbl>
surprise
<dbl>
trust
<dbl>
negative
<dbl>
1000000000
2000000000
3000000010
4000000000
5000000001
6000000000
tail(nrc_data)
anger
<dbl>
anticipation
<dbl>
disgust
<dbl>
fear
<dbl>
joy
<dbl>
sadness
<dbl>
surprise
<dbl>
trust
<dbl>
negative
<dbl>
116000000000
117200112104
118100101102
119100000001
120110100002
121110311113
summary(nrc_data)
     anger         anticipation       disgust             fear             joy        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
 Mean   :0.1901   Mean   :0.2562   Mean   :0.08264   Mean   :0.2562   Mean   :0.2066  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :3.0000   Max.   :3.0000   Max.   :1.00000   Max.   :3.0000   Max.   :2.0000  
    sadness          surprise          trust           negative         positive    
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000   Median :1.000  
 Mean   :0.1983   Mean   :0.1322   Mean   :0.4463   Mean   :0.6198   Mean   :1.058  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.000  
 Max.   :2.0000   Max.   :2.0000   Max.   :4.0000   Max.   :4.0000   Max.   :6.000  
Now for a plot of emotions.
pander::pandoc.table(nrc_data[, 1:8], split.table = Inf)
barplot(
  sort(colSums(prop.table(nrc_data[, 1:8]))), 
  horiz = TRUE, 
  cex.names = 0.7, 
  las = 1, 
  main = "Emotions in Sample text", xlab="Percentage"
  )
You could identify the sentences that have at least one identified emotion, for example, “sadness”.
nrc_data[nrc_data$sadness>0,]
anger
<dbl>
anticipation
<dbl>
disgust
<dbl>
fear
<dbl>
joy
<dbl>
sadness
<dbl>
surprise
<dbl>
trust
<dbl>
negative
<dbl>
11111201011
17000021121
36010001011
43000001001
46010111102
51101121123
57000012012
58000011021
59000011011
61000001002
This has been my first ever exercise on emotion detection from text and it gives me just a feel for this topic. Now the first thing you could do may be to inspect the text for the i-th sentence, say, from the vector of sentences “y_sen” for an emotion of your interest. You could do that as
y_sen[i]
to have some idea of how emotion detection works.

No comments:

Post a Comment