For my previous post “DIY ethics or Ethics express”, I used the text mining software
package “tm” to build a wordcloud from Google search results.
Text mining is a particular class of data mining which
Wikipedia defined as:
…
the
computational process of discovering patterns in large data sets
involving methods at the intersection of artificial intelligence,
machine learning, statistics, and database systems. ...The overall
goal of the data mining process is to extract information from a data
set and transform it into an understandable structure for further
use.
To be a bit more informed, I looked
around on the web and found pieces like these from the Cross
Validated and Stack Overflow websites:
I found similar questions/answers on
Quora, like the following with 69 answeres:
What
is the difference between Data Analytics, Data Analysis, Data Mining,
Data Science, Machine Learning, and Big Data?
TIL (Today I Learned …,
borrowing from Reddit) one recent post (May 9, 2016) on R-bloggers
website, which originally was from Sharp Site Labs, shares an
expert's opinion on “What’s
the difference between machine learning, statistics, and data
mining?”
Here Wasserman, a professor in both the Statistics and Machine
Learning Departments at “Carnegie Mellon,
one of the premier universities for stats and ML”, declared
that there is no difference. In his own words:
The
short answer is: None. They are … concerned with the same question:
how do we learn from data?
Further, the blog post gives reasons in
support of this primary characterization by Wasserman and the
following subtitles in the body of that post provided for me a neat
(or actually, lazy) way to summarize.
The
core similarities between ML, stats, and data mining
Nearly
identical subject-matter and toolkits
The
three cultures: why there are three identical subjects with different
names
The
core differences
between
ML, stats, and data mining
They
emphasize different things
Machine
learning is focused on software and systems
The
purpose of data mining is finding patterns in databases
They
use different words and terminology
ML,
stats, and data mining tend to favor different tools
ML
and data mining typically work on “bigger” data than statistics
Again:
there are far more similarities than differences
Even then, we fellow dummies may
benefit from more varied reading in this age of steam-rollers like
big data, machine learning, data science and data revolution
threatening to flatten a traditional discipline like statistics.
Here, I have to confess that however much I tried to follow the
discussions, controversies and the links provided from the pages of
Stack Overflow, Cross Validated, Quora and others, a lot of them
were, with me, TL;DR (Too Long; Didn't Read, again borrowing
from the Redditors).
Talking about Reddit, it is “an
entertainment, social news networking service, and news website”.
It had 542 million monthly visitors (234 million unique users) and
ranked 14th most visited web-site in US and 36th in the world in 2015
according to Wikipedia. Reddit has quite extensive posts and comments
on Machine Learning in its “MachineLearning” subreddit as
well as in related subreddits. Subreddits are different areas
of interest into which Reddit entries are organized into. The
following shows the screenshot of a part of the first page of of
Machine-Learning subreddit
showing threads in the “top” category.
Learning Machine-Learning (or
for that matter Data Mining, or Statistics, or …) is obviously
easier said than done. Yet we may approach any kind of
learning through a three-step process: စု-
တု-
ပြု(suu-tuu-pyuu)
or accumulate-imitate-create, as we Myanmars used to
say. Turning our conventional wisdom upside down, I would now
suggest that at least for the accumulation step, it could be easier
done than said! In that context, I've started out with trying to
give myself a DIY glimpse of the exciting world of getting data from
web services, social networks and others through api (application
programming interface). Now I'm sharing this— an api for the
masses of some sort.
What we do now is get the posts on
machine-learning from the Reddit website using the R package
“RedditExatractoR”. We assume that you know how to use the R
statistical environment and have the RedditExatractoR package ver.
2.0.2, the latest version, installed together with the two required
packages, RJSONIO and igraph. Try running this script
as I did:
By the end of the run you should find
the following files in your working directory:
ML_getRed.csv
ML_contRed.csv
ML_get_cont.RData
If you have saved or printed the graph
produced by the run to a file, you should get the following image of
the graph of comments for the thread “Google opens a Dedicated
Machine Learning Research Center in Europe” by the OP (original
poster, another Reddit term) jay_jay_man posted
on June 16, 2016.
You can see that I hadn't made the
graph readable and pretty. That would require a bit of tweaking the
plotting parameters and the RedditExatractoR help pages says nothing
about getting it done. So, like every wise instructor I would just
say that it is left as an exercise for the learner!
Anyway, the idea of the graph is great
and the retrieval of data is super though we still miss some minor
features and options on RedditExatractoR that are only available by
visiting the Reddit website directly.
Postscript: We certainly hope to
see some public or private or NGO/CSO or multi/bi-lateral agency
websites in Myanmar offering data services through web api's sometime
soon. My fellow dummies be ready! Not within the first hundred days,
though.
No comments:
Post a Comment