Saturday, February 16, 2013

Google Statistician uses R and other programming tools

A great interview on the Simply Statistics blog with Google's Nick Chamandy, Phd in Statistics.  Explains that he mainly uses R among other tools to perform his work at Google.  Also of note is the active data science community within Google that uses R as well as some other interesting tools.  Note that they use a lot of data at Google, understandably, and that R usually can not handle the size.  They do a lot of ad hoc reduction of the data with tools like map reduce, Go, and even an R API.  I would love to see how they use the R API to assimilate data.

An interesting insight from the interview is the amount of programming done by the Statisticians.  It seems the culture at Google is to foster autonomy and let the modelers develop their own data manipulation from the raw data.  This requires a broader skillset beyond the statistical analysis tools.

I've found in my work that having knowledge in many tools like R,  CPLEX, and GLPK allows me to be a more effective in my work.  Recently I've been learning a lot of SQL using the PostgreSQL platform.  The tools of SQL combined with statistical tools like R make for a very strong combination.  I'm very agile in my work and can do a varied number of decision analysis.

4 comments:

Antonio Piccolboni said...
This comment has been removed by the author.
Antonio Piccolboni said...

The statement that R can not handle large data sets is inaccurate. As hinted at in the answer R users at Google have access to most of Google's computing infrastructure and have achieved java-parity in speed while offering a much more powerful language(http://www.amstat.org/meetings/jsm/2012/onlineprogram/AbstractDetails.cfm?abstractid=303783). Then also Java, without Hadoop, can't handle very large data sets. Why should java augmented with a library still be called java and R with a package no longer R? The open source RHadoop https://github.com/RevolutionAnalytics/RHadoop project, on which I am a developer, has the similar goal of empowering R users to analyze the largest data sets, not just aggregations thereof, and while not as mature as the internal Google project, makes these capabilities available to the community.

Larry D'Agostino said...

Great comment Antonio. I agree with you. R does have capabilities to analyze data at rather large scales. I recently presented on the ff package at our local R users group. I'm hoping to post on tutorial soon.

Electrical Cord said...

Can you provide me the link of interview video? Meanwhile I would like to know more about The tools of SQL combined with statistical tools like R..