Tuesday, June 28, 2016

Suu steam-rolling III: aRxiv for arXiv


I had known a bright young nuclear scientist trained locally who went abroad a long time ago to learn about handling nuclear waste. Coming from a rural family of farmers, he told me that he had quite some trouble using a tea-bag properly for his first breakfast at the hotel. Then I recalled myself watching a gentleman from Sri-Lanka making tea over a conversation. It was some five or six years earlier when I was in the Pacific. There I learned that the secret of good tea is to add one more spoon of creamer—for the cup—to the two spoonfuls I had added for myself! Yet, despite some time I spent outside of Myanmar I wouldn't be able to set the table myself because I don't know the proper places for knives, forks, or spoons and I'm surely ignorant of which brand of tea should go with which occasion. It's my karma and in no way could I have attended a classy high school or have a proper upbringing, if you like. But I won't regret.

For that matter, I never owned a car, or learned to drive one. I'm just happy that I learned to use computers and smart phones a bit. I am happy that I could share things with ordinary folks and that matters.

Then there is one website I would like to share. I've heard that it could replace some technical journals which you won't get free. Welcome open access. Welcome arXiv.


According to ProgrammableWeb -

The Cornell University e-print arXiv, hosted at arXiv.org, is a document submission and retrieval system used by the physics, mathematics and computer science communities. It has become the primary means of communicating manuscripts on current and ongoing research. The arXiv repository is available worldwide. Manuscripts are often submitted to the arXiv before they are published by more traditional means. In some cases they may never be submitted or published elsewhere. The purpose of the arXiv API is to allow programmatic access to the arXiv's e-print content and metadata.

The aRxiv is an R interface to the arXiv API and the arXiv API does not require an API key. You can install aRxiv from any of the cran mirrors.

Below is my R script for finding papers on machine learning published on arXiv from the beginning of 2010, and viewing abstracts on the arXiv website. Once on the abstract page you could download the full paper by using the link provided. On the other hand you could save the search results for the arXiv papers to a text file. Then you could conveniently view them with a spreadsheet.


When you run the script, you will see your commands and outputs, and error messages, if any, displayed one after another on the R console. I assume you know how to save them to a text file for reference. Near the end of the script you'll see the arxiv_open() function. That will open the abstract page shown below and here you can see that you could download the pdf file of the full paper:


After the run, you should also find the ML_aRxSrch.csv file you've saved in your working directory. Part of it is shown below opened with the Open Office Calc spreadsheet. Some rows were hidden for convenience in displaying:



My purpose here and in some earlier posts is to introduce (to myself as well as others) some tools that could take advantage of the web APIs provided by publishers, Q/A sites, social media, and others. For simplicity I've been leaving the contents of their results untouched. However, I couldn't help feeling inspired by an instance of the struggle for ever greater understanding of older vs. newer ideas as I read this abstract:

Neuroimaging research has predominantly drawn conclusions based on classical statistics, including null-hypothesis testing, t-tests, and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity, including cross-validation, pattern classification, and sparsity-inducing regression. These two methodological families used for neuroimaging data analysis can be viewed as two extremes of a continuum. Yet, they originated from different historical contexts, build on different theories, rest on different assumptions, evaluate different outcome metrics, and permit different conclusions. This paper portrays commonalities and differences between classical statistics and statistical learning with their relation to neuroimaging research. The conceptual implications are illustrated in three common analysis scenarios. It is thus tried to resolve possible confusion between classical hypothesis testing and data-guided model estimation by discussing their ramifications for the neuroimaging access to neurobiology.

Despite my complete lack of idea on neuroimaging or neurobiology I sensed the promise that such improvements in knowledge would benefit our well-being somehow—collectively or individually. On the side, my feeling is that reading such abstracts could be something more than an idle recreation for the non-specialist. The bottom-line is that it may broaden our knowledge base or simply leave us with an appreciation of good things done (provided of course, that we could make any sense out of a given abstract).


Back to the basics: to play around with aRxiv package, a good start will be to read the vignette “aRxiv tutorial” in the help pages of “Html help” accessible from your R console. You could also download this tutorial here.

No comments:

Post a Comment