Friday, December 30, 2011

Most popular 2011 IEOR Tools blog articles

The most popular IEOR Tools blog articles of 2011.  It is time for reflection and I like to do this every year.  It gives me perspective about what is being read.  It is also an interesting look at our interests.  This year seems to be about our thirst for software tools and how to use them.  Also books are still big for reference materials.

  1. Open Source Replacements for Operations Research and Analytics Software
  2. R Tutorial: Add confidence intervals to a dot chart
  3. Science of Matchmaking
  4. Data Mining Books list
  5. Physicist cuts airplane boarding time in half
  6. R again in Google Summer of Code
  7. Moneyball coming to the big screen
  8. Baseball and Decision Analytics
  9. Question and Answer sites for Analytics and Operations Research
  10. Sports analytics summer blog reading recommendations

Wednesday, December 21, 2011

Visualizing categorical data in R

I came across an interesting SAS macro that was used for visualizing log odds relationships of data.  This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable.  I don't use SAS on a daily basis as I prefer to use R.  So I got to thinking that I could recreate this macro using only R.  I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.

The following picture is the result of the logodds function in R.  The chart is really close but not quite exact.  For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.


The following is the R script.

logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){

  # Assumptions: x & y are numeric vectors of the same
  # length, y is 0/1 varible.  This returns a vector
  # of breaks of the x variable where each bin has at
  # least min.countnumber of y's
  bin.by.other.count <- function(x, other, min.cnt=1) {
    csum <- cumsum(tapply(other, x, sum))
    breaks <- numeric(0)
 
    i <- 1
    breaks[i] <- as.numeric(names(csum)[1])
    cursum <- csum[1]
 
    for ( a in names(csum) ) {
      if ( csum[a] - cursum >= min.cnt ) {
        i <- i + 1
        breaks[i] <- as.numeric(a)
        cursum <- csum[a]
      }
    }
 
    breaks
  }
 
  brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
 
  # Visualizing binary categorical data
  var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
  var_mean <- tapply(data_dep, var_cut, mean)
  var_median <- tapply(data_ind, var_cut, median)
 
  mydf <- data.frame(ind=data_ind, dep=data_dep)
  fit <- glm(dep ~ ind, data=mydf, family=binomial())
  pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
          type="response", se.fit=T)
 
  # Plot
  plot(x=var_median, y=var_mean, ylim=c(0,1.15),
       xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
  stripchart(data_ind[data_dep==0], method="stack",
             at=0, add=T, col="grey")
  stripchart(data_ind[data_dep==1], method="stack",
             at=1, add=T, col="grey")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit, col="blue", lwd=2)
  lines(lowess(x=var_median,
               y=var_mean, f=.30), col="red")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}




 


logoddsFnc(icu$age, icu$died, "age", min.count=3)



The ICU data  for this example can be found in the R package "vcdExtra".  Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function. 

The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.





Thursday, December 15, 2011

OpenOpt Suite 0.37

Hi all,
I'm glad to inform you about new release 0.37 (2011-Dec-15) of our free software:

OpenOpt (numerical optimization):

  • IPOPT initialization time gap (time till first iteration) for FuncDesigner models has been decreased
  • Some improvements and bugfixes for interalg, especially for "search all SNLE solutions" mode (Systems of Non Linear Equations)
  • Eigenvalue problems (EIG) (in both OpenOpt and FuncDesigner)
  • Equality constraints for GLP (global) solver de
  • Some changes for goldenSection ftol stop criterion

FuncDesigner:

  • Major sparse Automatic differentiation improvements for badly-vectorized or unvectorized problems with lots of constraints (except of box bounds); some problems now work many times or orders faster (of course not faster than vectorized problems with insufficient number of variable arrays). It is recommended to retest your large-scale problems with useSparse = 'auto' | True| False
  • Two new methods for splines to check their quality: plot and residual
  • Solving ODE dy/dt = f(t) with specifiable accuracy by interalg
  • Speedup for solving 1-dimensional IP by interalg

SpaceFuncs and DerApproximator:

  • Some code cleanup

You may trace OpenOpt development information in our recently created entries in Twitter and Facebook, see http://openopt.org for details.

See also: FuturePlans, this release announcement in OpenOpt forum

Regards, Dmitrey.

Expanded list of online courses for data analysis

The folks at Stanford have been really busy putting together online curriculum for the world to learn.  This is a followup to a previous post Machine Learning for everyone.  Stanford has included a bunch of other courses that they will promote online.

Some of the interesting courses are
  • Model Thinking
  • Natural Language Processing
  • Game Theory
  • Design and Analysis of Algorithms 
Apart from the promoted online courses from Stanford there is also other courses of interest that are not promoted but still access to lectures and notes.  Notable courses from the Stanford School of Engineering include the following.
  • Introduction to Linear Dynamical Systems
  • Convex Optimization I
  • Convex Optimization II
Stanford isn't the only school that is promoting their lectures online for free use.  A lot of schools are promiting open learning and collaboration through what is called Open Courseware.  Some notable schools inlcuded.
As an analytics professional for many years I've found honing your skills to be very important for your career.  Now more than ever it is easier to do with schools opening up their classes for everyone.  I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.

Thursday, November 10, 2011

My learning as a Data Scientist

So apparently the new en-vogue title is Data Scientist.  I can now include that to my already expanding list of titles.  In the past I've been known as an Engineer, Operations Analyst, Production Control Specialist, and an Analytics Analyst.  Now I'm considered a Data Scientist.  It's all the same to me.  My training and expertise has allowed me to problem solve many challenges within organizations.  The title doesn't matter.  There are opportunities for people with my skill set.

A recent blog post by Kontagent Kaleidoscope about Big Data is Useless without Science got me thinking about my role as a self-proclaimed Data Scientist.  The blog article points out a need for the science of better decision making.  Organizations are looking for people to help them turn their data mines into information gold.  I've definitely learned a lot over the years as a Data Scientist and I thought I would list some of those learnings.

1.  Organizations Don't Know What a Data Scientist Can Do

The idea here is marketing your own talents.  The Data Scientist needs to put their methods and work out there for the organization to see and touch.  This means working with the peers and management in the organization.  The Data Scientist needs to be able to eloquently relate methods, problems, challenges and how they can be solved.  Important skills here are personal marketing and communication.  I know this goes against the grain of many numbers geeks like me.

2.  Problems Don't Solve Themselves

Opportunities for solving real problems in an organization are always around.  The trick is being in the right place at the right time to be able to solve those problems.  Organizations have hoarded a lot of data and many times they don't even remember why.  The Data Scientist needs to turn into a Data Detective.  Explore all aspects of the organization.  Interview different departments and see how they tick and ask questions like "What keeps you up at night about your job?".  I was often surprised how a simple solution would go a long way to helping someone else out.  This develops true collaboration and leads to bigger problems to solve.

3.  Always Continue to Learn New Things

The world is constantly evolving and there are always new tools, tricks, methods, algorithms, software and mechanisms.  The Data Scientist needs to be able to adapt to new technologies.  I've found its best to stay current with whats new in order to stay sharp and meet new demands.  The internet can be your friend.  Even keeping up with a favorite list of blogs can help with staying current.  Times change and so do organization's needs.  Perhaps this is just me but I love learning new things as it creates a fun diversion and improves my skill sets.



Tuesday, October 11, 2011

Top 50 Statistics blogs of 2011

TheBestColleges.org published their list of the top 50 statistics blogs.  This is a really good resource list of statistical analysis and news.

Monday, October 3, 2011

Data mining the Federal Reserve

The Federal Reserve now has the ability to have its data programmatically retrieved.  The St. Louis Fed Web Services allows programmers and data scientists to retrieve key economic data from their libraries.  I have not had a chance to peruse the site at all but this can be a really interesting source of data.  The age of Open Data is really upon us.  This can lead to some really interesting research for professionals and amateur scientists.