Sunday, June 19, 2016

Suu steam-rolling


For my previous post “DIY ethics or Ethics express”, I used the text mining software package “tm” to build a wordcloud from Google search results. Text mining is a particular class of data mining which Wikipedia defined as:

the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. ...The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

To be a bit more informed, I looked around on the web and found pieces like these from the Cross Validated and Stack Overflow websites:

I found similar questions/answers on Quora, like the following with 69 answeres:


TIL (Today I Learned …, borrowing from Reddit) one recent post (May 9, 2016) on R-bloggers website, which originally was from Sharp Site Labs, shares an expert's opinion on “What’s the difference between machine learning, statistics, and data mining?” Here Wasserman, a professor in both the Statistics and Machine Learning Departments at Carnegie Mellon, one of the premier universities for stats and ML”, declared that there is no difference. In his own words:

The short answer is: None. They are … concerned with the same question: how do we learn from data?

Further, the blog post gives reasons in support of this primary characterization by Wasserman and the following subtitles in the body of that post provided for me a neat (or actually, lazy) way to summarize.

The core similarities between ML, stats, and data mining
Nearly identical subject-matter and toolkits
The three cultures: why there are three identical subjects with different names

The core differences between ML, stats, and data mining
They emphasize different things
Machine learning is focused on software and systems
The purpose of data mining is finding patterns in databases
They use different words and terminology
ML, stats, and data mining tend to favor different tools
ML and data mining typically work on “bigger” data than statistics

Again: there are far more similarities than differences

Even then, we fellow dummies may benefit from more varied reading in this age of steam-rollers like big data, machine learning, data science and data revolution threatening to flatten a traditional discipline like statistics. Here, I have to confess that however much I tried to follow the discussions, controversies and the links provided from the pages of Stack Overflow, Cross Validated, Quora and others, a lot of them were, with me, TL;DR (Too Long; Didn't Read, again borrowing from the Redditors).

Talking about Reddit, it is “an entertainment, social news networking service, and news website”. It had 542 million monthly visitors (234 million unique users) and ranked 14th most visited web-site in US and 36th in the world in 2015 according to Wikipedia. Reddit has quite extensive posts and comments on Machine Learning in its “MachineLearning” subreddit as well as in related subreddits. Subreddits are different areas of interest into which Reddit entries are organized into. The following shows the screenshot of a part of the first page of of Machine-Learning subreddit showing threads in the “top” category.



Learning Machine-Learning (or for that matter Data Mining, or Statistics, or …) is obviously easier said than done. Yet we may approach any kind of learning through a three-step process: စု- တု- ပြု(suu-tuu-pyuu) or accumulate-imitate-create, as we Myanmars used to say. Turning our conventional wisdom upside down, I would now suggest that at least for the accumulation step, it could be easier done than said! In that context, I've started out with trying to give myself a DIY glimpse of the exciting world of getting data from web services, social networks and others through api (application programming interface). Now I'm sharing this— an api for the masses of some sort.

What we do now is get the posts on machine-learning from the Reddit website using the R package “RedditExatractoR”. We assume that you know how to use the R statistical environment and have the RedditExatractoR package ver. 2.0.2, the latest version, installed together with the two required packages, RJSONIO and igraph. Try running this script as I did:


By the end of the run you should find the following files in your working directory:

ML_getRed.csv
ML_contRed.csv
ML_get_cont.RData



If you have saved or printed the graph produced by the run to a file, you should get the following image of the graph of comments for the thread “Google opens a Dedicated Machine Learning Research Center in Europe” by the OP (original poster, another Reddit term) jay_jay_man posted on June 16, 2016.


You can see that I hadn't made the graph readable and pretty. That would require a bit of tweaking the plotting parameters and the RedditExatractoR help pages says nothing about getting it done. So, like every wise instructor I would just say that it is left as an exercise for the learner!

Anyway, the idea of the graph is great and the retrieval of data is super though we still miss some minor features and options on RedditExatractoR that are only available by visiting the Reddit website directly.


Postscript: We certainly hope to see some public or private or NGO/CSO or multi/bi-lateral agency websites in Myanmar offering data services through web api's sometime soon. My fellow dummies be ready! Not within the first hundred days, though.

No comments:

Post a Comment