Thursday, February 2, 2012

R graphic used for Facebook IPO

Apparently former Facebook intern, Paul Butler,  graphic of the Facebook social network graph is being used for Facebook's IPO.  The social network graphic is featured on Page 7 of the IPO filing.  His graphic was featured on mashable and R-bloggers not too long ago.  The graphic is of Facebook connections between city centers around the world.  Paul used an ingenious method of color transparency and great circle arcs to display the social network graph.

This is just one of the really cool things you can do with R.  Not only is R used as a visual medium but also to calculate the great circle paths.  This is really neat to see R in such a high profile setting.  If you want to learn more about R you can read an IEORTools post about R links for beginners on World Statistics Day.  Also there are many books that you can buy on R programming at the IEORTools Online Store.

Thursday, January 12, 2012

Should science be open

Two interesting articles appeared this week in some blogs I frequent about technology and science.  The first is an Op-ed in the New York Times titled Research Bought, Then Paid For and the next is Open Science: why is it so hard?  The two articles are a different take on the idea that scientific findings should be open for everyone.  Someone who is outside the scientific community might think that statement is silly.  Of course science is open.  No one has a copyright or a monopoly on scientific or mathematical discoveries.  Yet that is not the real issue.  The real issue is the access to those scientific discoveries.  In some cases the scientific discoveries are paid for by public subsidies.

The main focus of those two articles is that science has been hijacked by the publishers.  The articles even go so far as saying the hijacking is a monopoly of sorts.  I think monopoly is too strong of an analogy but the publishers do have a lot of control.  The control is mostly about access to the science.  The publishers own the copyright and can limit access to anyone unless a fee is paid.  A lot of the times these fees are rather high.  Now it looks like with the Research Works Act the access to publicly funded scientific research will be limited as well.  Access to the science is the crux of the debate.

Academics rely on publishing of their scientific findings for further funding of their research.  It is part of the academic circle of life.  Publishing begets more funding which begets more publishing and the cycle continues.  I do believe academic community deserves to get compensated for their research.  I'm not sure how much residual income they get other than peer review notoriety from their published content.  Publishers seem, again, to have a lot of the control. 

I am not an academic researcher.  My work is trying to help organizations better themselves by using the learning, skills, and knowledge I have acquired through the years as an Operations Research professional.  I try to keep up to date on the latest research and methods by studying journals, networking with colleagues, and reading articles.  I rely on scientific access quite a bit in staying up to date with the latest findings.  I rely on the academic community so I can improve my knowledge and skills.  Yet it seems very difficult for my to gain access to a lot of good research.  There has to be a common ground for access to the science.  I wish I had a simple solution to this issue but it seems very large and very complicated.  There are a lot of interactions that I am sure I am glossing over.  Yet I am a big fan of the idea of Open Science.

There are some publishers that do understand this problem.  INFORMS seems to get this issue rather well.  They do not charge a lot for their journals.  In fact as part of membership INFORMS allows two free subscriptions to any journal of your choosing.  In addition to that the PubsOnLine Suite is available for $99 which is 12 journals for a whole year.  That is a bargain compared to some other publishers.  So not all publishers are pure evil.  There are some good ones.

Monday, January 2, 2012

IEORTools.com Resources added

I've decided to spruce up my personal website IEORTools.com.  I want to add some additional resources to it along with the book store.  Most of the content will be relevant reference links to Industrial Engineering and Operations Research professionals.

The first thing I did was added a Resources side menu.  The Resources side menu will link to relevant resource sections.  So far I have created the following resources
These links are a collection of resources that I have accumulated over the years.  The links are a great reference and hopefully I can build them up more.  I'm going to be creating more content on ieortools.com site as opposed to the blog because I'm just running out of room.

Friday, December 30, 2011

Most popular 2011 IEOR Tools blog articles

The most popular IEOR Tools blog articles of 2011.  It is time for reflection and I like to do this every year.  It gives me perspective about what is being read.  It is also an interesting look at our interests.  This year seems to be about our thirst for software tools and how to use them.  Also books are still big for reference materials.

  1. Open Source Replacements for Operations Research and Analytics Software
  2. R Tutorial: Add confidence intervals to a dot chart
  3. Science of Matchmaking
  4. Data Mining Books list
  5. Physicist cuts airplane boarding time in half
  6. R again in Google Summer of Code
  7. Moneyball coming to the big screen
  8. Baseball and Decision Analytics
  9. Question and Answer sites for Analytics and Operations Research
  10. Sports analytics summer blog reading recommendations

Wednesday, December 21, 2011

Visualizing categorical data in R

I came across an interesting SAS macro that was used for visualizing log odds relationships of data.  This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable.  I don't use SAS on a daily basis as I prefer to use R.  So I got to thinking that I could recreate this macro using only R.  I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.

The following picture is the result of the logodds function in R.  The chart is really close but not quite exact.  For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.


The following is the R script.

logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){

  # Assumptions: x & y are numeric vectors of the same
  # length, y is 0/1 varible.  This returns a vector
  # of breaks of the x variable where each bin has at
  # least min.countnumber of y's
  bin.by.other.count <- function(x, other, min.cnt=1) {
    csum <- cumsum(tapply(other, x, sum))
    breaks <- numeric(0)
 
    i <- 1
    breaks[i] <- as.numeric(names(csum)[1])
    cursum <- csum[1]
 
    for ( a in names(csum) ) {
      if ( csum[a] - cursum >= min.cnt ) {
        i <- i + 1
        breaks[i] <- as.numeric(a)
        cursum <- csum[a]
      }
    }
 
    breaks
  }
 
  brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
 
  # Visualizing binary categorical data
  var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
  var_mean <- tapply(data_dep, var_cut, mean)
  var_median <- tapply(data_ind, var_cut, median)
 
  mydf <- data.frame(ind=data_ind, dep=data_dep)
  fit <- glm(dep ~ ind, data=mydf, family=binomial())
  pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
          type="response", se.fit=T)
 
  # Plot
  plot(x=var_median, y=var_mean, ylim=c(0,1.15),
       xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
  stripchart(data_ind[data_dep==0], method="stack",
             at=0, add=T, col="grey")
  stripchart(data_ind[data_dep==1], method="stack",
             at=1, add=T, col="grey")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit, col="blue", lwd=2)
  lines(lowess(x=var_median,
               y=var_mean, f=.30), col="red")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}




 


logoddsFnc(icu$age, icu$died, "age", min.count=3)



The ICU data  for this example can be found in the R package "vcdExtra".  Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function. 

The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.





Thursday, December 15, 2011

OpenOpt Suite 0.37

Hi all,
I'm glad to inform you about new release 0.37 (2011-Dec-15) of our free software:

OpenOpt (numerical optimization):

  • IPOPT initialization time gap (time till first iteration) for FuncDesigner models has been decreased
  • Some improvements and bugfixes for interalg, especially for "search all SNLE solutions" mode (Systems of Non Linear Equations)
  • Eigenvalue problems (EIG) (in both OpenOpt and FuncDesigner)
  • Equality constraints for GLP (global) solver de
  • Some changes for goldenSection ftol stop criterion

FuncDesigner:

  • Major sparse Automatic differentiation improvements for badly-vectorized or unvectorized problems with lots of constraints (except of box bounds); some problems now work many times or orders faster (of course not faster than vectorized problems with insufficient number of variable arrays). It is recommended to retest your large-scale problems with useSparse = 'auto' | True| False
  • Two new methods for splines to check their quality: plot and residual
  • Solving ODE dy/dt = f(t) with specifiable accuracy by interalg
  • Speedup for solving 1-dimensional IP by interalg

SpaceFuncs and DerApproximator:

  • Some code cleanup

You may trace OpenOpt development information in our recently created entries in Twitter and Facebook, see http://openopt.org for details.

See also: FuturePlans, this release announcement in OpenOpt forum

Regards, Dmitrey.

Expanded list of online courses for data analysis

The folks at Stanford have been really busy putting together online curriculum for the world to learn.  This is a followup to a previous post Machine Learning for everyone.  Stanford has included a bunch of other courses that they will promote online.

Some of the interesting courses are
  • Model Thinking
  • Natural Language Processing
  • Game Theory
  • Design and Analysis of Algorithms 
Apart from the promoted online courses from Stanford there is also other courses of interest that are not promoted but still access to lectures and notes.  Notable courses from the Stanford School of Engineering include the following.
  • Introduction to Linear Dynamical Systems
  • Convex Optimization I
  • Convex Optimization II
Stanford isn't the only school that is promoting their lectures online for free use.  A lot of schools are promiting open learning and collaboration through what is called Open Courseware.  Some notable schools inlcuded.
As an analytics professional for many years I've found honing your skills to be very important for your career.  Now more than ever it is easier to do with schools opening up their classes for everyone.  I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.