The stackr
package is an R package which you can use to get data from Stack
Exchange family of websites. Well, the idea for a big part of my
Bayanathi blog has been and will be to stick to “Yan can cook”
philosophy. The only difference with Yan is that his demonstrations
are based on his famous cooking and years of demos—mine were based on
seeing some nice recipes followed by my first attempts to cook them!
Currently I am trying out some “Suu” recipes so that my fellow
dummies may “Tuu” and (granted that my dishes were too salty,
oily, or smelly and somewhat burnt) improve on them to get to
“Pyuu” finally.
Stack
Exchange is
a network of question
and answer Web sites on
topics in varied fields, each site covering a specific topic, where
questions, answers, and users are subject to a reputation
award process.
The sites are modeled after Stack
Overflow,
a Q&A site for computer
programming questions
that was the original site in this network. The reputation system
allows the sites to be self-moderating.
(Stack Exchange, Wikipedia)
Stack
Overflow is
a privately held website, the flagship site of the Stack
Exchange Network,
…
It
features questions and answers on a wide range of topics in computer
programming.
… As
of April 2014, Stack Overflow has over 4,000,000 registered users
and
more than 10,000,000 questions,with 10,000,000 questions celebrated
in
late August 2015.
(Stack Overflow,
Wikipedia)
Since by default, stackr search takes your query to the Stack
Overflow website, it would be good to read about it now, or later
after trying out stackr. In my own dumb way, I went to Stack
Overflow, for example, and read about machine learning and related
topics as far my as my budget allows for internet access through my
cell phone. Mostly skimming over the contents and after getting
tired, bored, or frustrated I would look for something that I could
do with my own hands like some interesting R script to run related to
what I've been reading. Or the process could have been reversed as I
would have a minimal reading of an interesting topic, found some
analysis I could replicate, tried it, and went on to search and read
until I am through with it, or decide to stop. For yourselves, choose
your styles of working, or no style at all. But enjoy them all the
same.
So there is an introduction to stackr by its creator:
Introducing
stackr: An R package for querying the Stack Exchange API
(David
Robinson, Variance
Explained,
Feb. 4, 2015).
He noted:
The
package is straightforward to use. Every function starts with stack_:
stack_answer
to query answers, stack_questions
for questions, stack_users,
stack_tags,
and so on. Each output is a data frame, where each row represents one
object (an answer, question, user, etc). The package also provides
features for sorting and filtering results in the API: almost all the
features available in the API itself. Since the API has an upper
limit of returning 100 results at a time, the package also handles
pagination so you can get as many results as you need.
He went on to show how to analyze a
single user's answering activity including the mean number of
accepted answers, how the answering activity changed over time by
counting the answers by month and changes in answering activity in
the course of day at hourly intervals and made graphs. He also showed
how to count the numbers of answers for different tags and
make a bar chart as well as a wordcloud out of that.
However,
for this post, I've tried getting similar kind of data we've got out
of Reddit using the RedditExtractoR package as we've done for our
last post. In Reddit you can use search terms
and then subreddits to
restrict the search to particular grouping(s) of entries organized by
areas of interest called subreddits. In
Stack Overflow, the item similar to a subreddit
is a tag. For example
in Reddit you use subreddit = “machinelearning”
and in Stack Overflow you use tag = “machine-learning”.
Our Suu exercise plan would be
similar to that for Reddit:
- You use stack_search() function for a tag, say “machine-learning” and with the word “mining” in the title of the question.
- As with Reddit in our last post we will ask for three pages of result, but accepting the default page size; (unlike RedditExtractoR which returns a fixed page size of 25 entries, with stackr you can specify up to 100).
- In the Reddit exercise we looked at the most recent posts and chose one: “Google opens a Dedicated Machine Learning Research Center in Europe” which is a news post; here in Stack Exchange an original post could only be a question. And you would like to get the answers for the question with the largest number of views from the three retrieved pages, say. For that, you take the question_id for which the view_count is maximum. Then get the text of the question, answers to it, comments, etc. (However, I couldn't find way to do that in stackr yet, and so I have to do that in my own dumb way—using the browser by calling it from the R session). Here's my script:
After running the script you should see
a webpage like this:
You should also find these two files in
your working directory: (i) ML_stackrQ.csv, and (ii)
ML_stackrQ.RData. Running the script given above to get the
results mentioned should give no trouble at all, so also will be for
you to try out different search terms and/or tags, or
to choose which webpage to open to see the answers and comments, etc.
However, installing stackr may
not be trouble-free, because this package is not available on the
main cran repository for R packages as it was with
RedditExtractoR package. It is available only from GitHub
repository here.
Above screenshot shows stackr
page on GitHub with “Clone or download” button clicked. I
chose to download the ZIP file and then followed the instruction for
installing it. You'll find the instruction when you scroll down the
page a little.
I chose the second option. If so get
yourself a good internet connection and be patient. Be prepared to
get devtools and other 20-plus R packages installed on your
computer first. They are required for installing stackr!
As I've seen, stackr may not be
one to get you the body of posts such as questions, answers, and
comments (as yet?). It is perfect for getting the date they were
posted, date of last activity, edit date, closed date; qualitative
data like if the question was answered, answer accepted or not, if
the question has been closed by moderators, reason for that; and
quantitative data like view_count, answer_count, score,
owner_reputation score, and owner_accept_rate. Since your query
return these values and others in an R dataframe, you should have no
trouble doing analysis of your choice, including those shown by the
stackr package creator Robinson linked earlier in this post.
P.S.
Lesson:
Sly work supervisors start you off with
small batches.
So did I for exercises on Reddit and
Stack Overflow with “Easier done than said” batches.
Each has been designed to give myself
and my fellow dummies an accelerated sense of achievement. I don't
know if that could qualify as an application of Nudge
theory popular now-a-days with the executive branches of U.S. and
Britain. (Bear with me, I'm showing off that I've heard of Nudge).
Risk:
Dropouts.
No comments:
Post a Comment