Friday, June 24, 2016

Suu steam-rolling II

The stackr package is an R package which you can use to get data from Stack Exchange family of websites. Well, the idea for a big part of my Bayanathi blog has been and will be to stick to “Yan can cook” philosophy. The only difference with Yan is that his demonstrations are based on his famous cooking and years of demos—mine were based on seeing some nice recipes followed by my first attempts to cook them! Currently I am trying out some “Suu” recipes so that my fellow dummies may “Tuu” and (granted that my dishes were too salty, oily, or smelly and somewhat burnt) improve on them to get to “Pyuu” finally.

Stack Exchange is a network of question and answer Web sites on topics in varied fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The sites are modeled after Stack Overflow, a Q&A site for computer programming questions that was the original site in this network. The reputation system allows the sites to be self-moderating. (Stack Exchange, Wikipedia)

Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, It features questions and answers on a wide range of topics in computer programming.As of April 2014, Stack Overflow has over 4,000,000 registered users and more than 10,000,000 questions,with 10,000,000 questions celebrated in late August 2015. (Stack Overflow, Wikipedia)

Since by default, stackr search takes your query to the Stack Overflow website, it would be good to read about it now, or later after trying out stackr. In my own dumb way, I went to Stack Overflow, for example, and read about machine learning and related topics as far my as my budget allows for internet access through my cell phone. Mostly skimming over the contents and after getting tired, bored, or frustrated I would look for something that I could do with my own hands like some interesting R script to run related to what I've been reading. Or the process could have been reversed as I would have a minimal reading of an interesting topic, found some analysis I could replicate, tried it, and went on to search and read until I am through with it, or decide to stop. For yourselves, choose your styles of working, or no style at all. But enjoy them all the same.

So there is an introduction to stackr by its creator: Introducing stackr: An R package for querying the Stack Exchange API (David Robinson, Variance Explained, Feb. 4, 2015). He noted:

The package is straightforward to use. Every function starts with stack_: stack_answer to query answers, stack_questions for questions, stack_users, stack_tags, and so on. Each output is a data frame, where each row represents one object (an answer, question, user, etc). The package also provides features for sorting and filtering results in the API: almost all the features available in the API itself. Since the API has an upper limit of returning 100 results at a time, the package also handles pagination so you can get as many results as you need.

He went on to show how to analyze a single user's answering activity including the mean number of accepted answers, how the answering activity changed over time by counting the answers by month and changes in answering activity in the course of day at hourly intervals and made graphs. He also showed how to count the numbers of answers for different tags and make a bar chart as well as a wordcloud out of that.

However, for this post, I've tried getting similar kind of data we've got out of Reddit using the RedditExtractoR package as we've done for our last post. In Reddit you can use search terms and then subreddits to restrict the search to particular grouping(s) of entries organized by areas of interest called subreddits. In Stack Overflow, the item similar to a subreddit is a tag. For example in Reddit you use subreddit = “machinelearning” and in Stack Overflow you use tag = “machine-learning”.

Our Suu exercise plan would be similar to that for Reddit:
  1. You use stack_search() function for a tag, say “machine-learning” and with the word “mining” in the title of the question.
  2. As with Reddit in our last post we will ask for three pages of result, but accepting the default page size; (unlike RedditExtractoR which returns a fixed page size of 25 entries, with stackr you can specify up to 100).
  3. In the Reddit exercise we looked at the most recent posts and chose one: “Google opens a Dedicated Machine Learning Research Center in Europe” which is a news post; here in Stack Exchange an original post could only be a question. And you would like to get the answers for the question with the largest number of views from the three retrieved pages, say. For that, you take the question_id for which the view_count is maximum. Then get the text of the question, answers to it, comments, etc. (However, I couldn't find way to do that in stackr yet, and so I have to do that in my own dumb way—using the browser by calling it from the R session). Here's my script:

After running the script you should see a webpage like this:

You should also find these two files in your working directory: (i) ML_stackrQ.csv, and (ii) ML_stackrQ.RData. Running the script given above to get the results mentioned should give no trouble at all, so also will be for you to try out different search terms and/or tags, or to choose which webpage to open to see the answers and comments, etc.

However, installing stackr may not be trouble-free, because this package is not available on the main cran repository for R packages as it was with RedditExtractoR package. It is available only from GitHub repository here.

Above screenshot shows stackr page on GitHub with “Clone or download” button clicked. I chose to download the ZIP file and then followed the instruction for installing it. You'll find the instruction when you scroll down the page a little.

I chose the second option. If so get yourself a good internet connection and be patient. Be prepared to get devtools and other 20-plus R packages installed on your computer first. They are required for installing stackr!

As I've seen, stackr may not be one to get you the body of posts such as questions, answers, and comments (as yet?). It is perfect for getting the date they were posted, date of last activity, edit date, closed date; qualitative data like if the question was answered, answer accepted or not, if the question has been closed by moderators, reason for that; and quantitative data like view_count, answer_count, score, owner_reputation score, and owner_accept_rate. Since your query return these values and others in an R dataframe, you should have no trouble doing analysis of your choice, including those shown by the stackr package creator Robinson linked earlier in this post.

Sly work supervisors start you off with small batches.
So did I for exercises on Reddit and Stack Overflow with “Easier done than said” batches.
Each has been designed to give myself and my fellow dummies an accelerated sense of achievement. I don't know if that could qualify as an application of Nudge theory popular now-a-days with the executive branches of U.S. and Britain. (Bear with me, I'm showing off that I've heard of Nudge).


No comments:

Post a Comment