These days others were thinking hard about Ayarwady. Time to time we have to think about Ayarwady too. In reality, we are the only ones who will make or break Ayarwady. That being clear, I hope that I am not going to hear sad news about Ayarwady any time soon, or in what will be left of my lifetime.
In the context of a few posts back in this blog, I was fascinated by the idea of sentiment analysis of comments like those in Facebook, if possible in our own Myanmar language. That led me into playing with syllable segmentation and word segmentation on Myanmar text. While I was very much enjoying that, some articles on emotion detection in text caught my eyes. I was excited. Never heard of it before. Could dummies do that? Let me give a try.
In fact, these words from Emotion Detection in Text: a Review by Seyeditabari and others motivated me to leapfrog sentiments analysis in favor of emotion detection in my typical dummy-way:
“Emotion detection in computational linguistics is the process of identifying discrete emotion expressed in text. Emotion analysis can be viewed as a natural evolution of sentiment analysis and its more fine-grained model. … Sentiment analysis, with thousands of articles written about its methods and applications, is a well established field in natural language processing. It has proven very useful in several applications such as marketing, advertising … question answering systems … summarization … as part of recommendation systems … or even improving information extraction … . On the other hand, the amount of useful information which can be gained by moving past the negative and positive sentiments and towards identifying discrete emotions can help improve many applications mentioned above, and also open ways to new use cases”.
With a craze, more or less, I began by downloading the Wikipedia page on our untouchable Ayarwady (Irrawaddy) river. Actually I was sidetracked into this also by my frustrations from the failure to extract text out of the the entire set of Myanmar Wikipedia pages. I was unhappy about the scarcity of Myanmar language text collection, or Myanmar language corpus. as the NLP community calls it. Therefore, I’m going to create a big Myanmar corpus out ot the Wikipedia Myanmar pages, since that is a good sizable collection of Myanmar language in Unicode, I thought. Then this task proved to be too big for my tools at hand. But that certainly calls for another story.
Anyway, to make the story short, I choose to try out the Syuzhet R-package for detecting emotions from the Wikipedia article on the Ayarwaddy river. It was quite easy. It got no brain-work of mine. I just followed the examples given in the vignette (Introduction to the Syuzhet Package) that comes with the installation of the package.
Getting the data ready for processing
To analyze the data in sentence form in Syuzhet, you will have to begin by using get_sentences() function on your text. That will tokenize your text into sentences which means that a vector of sentences will be created.
For the following exercise, you need to have the R packages, syuzhet and htm2txt installed on your computer.
library(syuzhet)
# Loads a file as a single text string.
ayarT <- get_text_as_string("https://en.wikipedia.org/wiki/Irrawaddy_River")
# Get tokenized sentences from ayarT.
ayar_sen <- get_sentences(ayarT)
head(ayar_sen)
The above code block imports the Wikipedia page as a single text string and tokenize it into sentences. This text string contains html tags which need to be removed.So instead of it, I used htm2txt package to import plain text from the webpage and then saved it as a utf-8 format text file.
x <- htm2txt::gettxt("https://en.wikipedia.org/wiki/Irrawaddy_River")
str(x)
chr "Irrawaddy River\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJump to search\n\n\"Ayeyarwady\"| __truncated__
I saved it to a text file.
zz <- file("ayar0.txt", "w")
writeLines(x, con=zz, useBytes = TRUE)
close(zz)
The plain text file contained 582 lines of text including blank lines.I deleted parts of the text by hand to leave only the body of the text. For the reproducibility of the results posted here, I’ve made this file available here.
Detecting emotions from the text
library(syuzhet)
package 㤼㸱syuzhet㤼㸲 was built under R version 3.5.2
# Loads the file as a single text string.
y <- get_text_as_string("ayar.txt")
# tokenize sentences
y_sen <- get_sentences(y)
str(y_sen)
chr [1:121] "Irrawaddy River The Irrawaddy or, officially, Ayeyarwady[4] River (Burmese: á\200§á\200›á\200¬á\200\235á\200\2"| __truncated__ ...
head(y_sen)
[1] "Irrawaddy River The Irrawaddy or, officially, Ayeyarwady[4] River (Burmese: ဧရာá€\u009dá€\u0090ီမြစ်; MLCTS: erawa."
[2] "ti mrac, pronounced [Ê\"èjà wÉ™dì mjɪÊ\"], also spelt Ayeyarwaddy[citation needed]) is a river that flows from north to south through Myanmar."
[3] "It is the country's largest river and most important commercial waterway."
[4] "Originating from the confluence of the N'mai and Mali rivers,[5] it flows relatively straight North-South before emptying through the Irrawaddy Delta into the Andaman Sea."
[5] "Its drainage basin of about 404,200 square kilometres (156,100 sq mi) covers a large part of Burma."
[6] "After Rudyard Kipling's poem, it is sometimes referred to as 'The Road to Mandalay'."
To identify emotions in the sentences the NRC Emotion lexicon by Saif MOhammad is used. According to him, “the NRC emotion lexicon is a list of words and their associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive)” In the data frame returned by the function call below, each row represents a sentence from the originak text showing number of detection(s) by each emotion/sentiment.
nrc_data <- get_nrc_sentiment(y_sen)
package 㤼㸱bindrcpp㤼㸲 was built under R version 3.5.2
head(nrc_data)
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
||
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
tail(nrc_data)
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
||
---|---|---|---|---|---|---|---|---|---|---|
116 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
117 | 2 | 0 | 0 | 1 | 1 | 2 | 1 | 0 | 4 | |
118 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 2 | |
119 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
120 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | |
121 | 1 | 1 | 0 | 3 | 1 | 1 | 1 | 1 | 3 |
summary(nrc_data)
anger anticipation disgust fear joy
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
Mean :0.1901 Mean :0.2562 Mean :0.08264 Mean :0.2562 Mean :0.2066
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
Max. :3.0000 Max. :3.0000 Max. :1.00000 Max. :3.0000 Max. :2.0000
sadness surprise trust negative positive
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :1.000
Mean :0.1983 Mean :0.1322 Mean :0.4463 Mean :0.6198 Mean :1.058
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.000
Max. :2.0000 Max. :2.0000 Max. :4.0000 Max. :4.0000 Max. :6.000
Now for a plot of emotions.
pander::pandoc.table(nrc_data[, 1:8], split.table = Inf)
barplot(
sort(colSums(prop.table(nrc_data[, 1:8]))),
horiz = TRUE,
cex.names = 0.7,
las = 1,
main = "Emotions in Sample text", xlab="Percentage"
)
You could identify the sentences that have at least one identified emotion, for example, “sadness”.
nrc_data[nrc_data$sadness>0,]
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
<dbl>
|
||
---|---|---|---|---|---|---|---|---|---|---|
11 | 1 | 1 | 1 | 2 | 0 | 1 | 0 | 1 | 1 | |
17 | 0 | 0 | 0 | 0 | 2 | 1 | 1 | 2 | 1 | |
36 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | |
43 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | |
46 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 2 | |
51 | 1 | 0 | 1 | 1 | 2 | 1 | 1 | 2 | 3 | |
57 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 2 | |
58 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 2 | 1 | |
59 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | |
61 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 |
This has been my first ever exercise on emotion detection from text and it gives me just a feel for this topic. Now the first thing you could do may be to inspect the text for the i-th sentence, say, from the vector of sentences “y_sen” for an emotion of your interest. You could do that as
y_sen[i]
to have some idea of how emotion detection works.
No comments:
Post a Comment