Friday, December 30, 2011

Most popular 2011 IEOR Tools blog articles

The most popular IEOR Tools blog articles of 2011.  It is time for reflection and I like to do this every year.  It gives me perspective about what is being read.  It is also an interesting look at our interests.  This year seems to be about our thirst for software tools and how to use them.  Also books are still big for reference materials.

  1. Open Source Replacements for Operations Research and Analytics Software
  2. R Tutorial: Add confidence intervals to a dot chart
  3. Science of Matchmaking
  4. Data Mining Books list
  5. Physicist cuts airplane boarding time in half
  6. R again in Google Summer of Code
  7. Moneyball coming to the big screen
  8. Baseball and Decision Analytics
  9. Question and Answer sites for Analytics and Operations Research
  10. Sports analytics summer blog reading recommendations

Wednesday, December 21, 2011

Visualizing categorical data in R

I came across an interesting SAS macro that was used for visualizing log odds relationships of data.  This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable.  I don't use SAS on a daily basis as I prefer to use R.  So I got to thinking that I could recreate this macro using only R.  I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.

The following picture is the result of the logodds function in R.  The chart is really close but not quite exact.  For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.


The following is the R script.

logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){

  # Assumptions: x & y are numeric vectors of the same
  # length, y is 0/1 varible.  This returns a vector
  # of breaks of the x variable where each bin has at
  # least min.countnumber of y's
  bin.by.other.count <- function(x, other, min.cnt=1) {
    csum <- cumsum(tapply(other, x, sum))
    breaks <- numeric(0)
 
    i <- 1
    breaks[i] <- as.numeric(names(csum)[1])
    cursum <- csum[1]
 
    for ( a in names(csum) ) {
      if ( csum[a] - cursum >= min.cnt ) {
        i <- i + 1
        breaks[i] <- as.numeric(a)
        cursum <- csum[a]
      }
    }
 
    breaks
  }
 
  brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
 
  # Visualizing binary categorical data
  var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
  var_mean <- tapply(data_dep, var_cut, mean)
  var_median <- tapply(data_ind, var_cut, median)
 
  mydf <- data.frame(ind=data_ind, dep=data_dep)
  fit <- glm(dep ~ ind, data=mydf, family=binomial())
  pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
          type="response", se.fit=T)
 
  # Plot
  plot(x=var_median, y=var_mean, ylim=c(0,1.15),
       xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
  stripchart(data_ind[data_dep==0], method="stack",
             at=0, add=T, col="grey")
  stripchart(data_ind[data_dep==1], method="stack",
             at=1, add=T, col="grey")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit, col="blue", lwd=2)
  lines(lowess(x=var_median,
               y=var_mean, f=.30), col="red")
 
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
  lines(x=min(data_ind):max(data_ind),
        y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}




 


logoddsFnc(icu$age, icu$died, "age", min.count=3)



The ICU data  for this example can be found in the R package "vcdExtra".  Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function. 

The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.





Thursday, December 15, 2011

OpenOpt Suite 0.37

Hi all,
I'm glad to inform you about new release 0.37 (2011-Dec-15) of our free software:

OpenOpt (numerical optimization):

  • IPOPT initialization time gap (time till first iteration) for FuncDesigner models has been decreased
  • Some improvements and bugfixes for interalg, especially for "search all SNLE solutions" mode (Systems of Non Linear Equations)
  • Eigenvalue problems (EIG) (in both OpenOpt and FuncDesigner)
  • Equality constraints for GLP (global) solver de
  • Some changes for goldenSection ftol stop criterion

FuncDesigner:

  • Major sparse Automatic differentiation improvements for badly-vectorized or unvectorized problems with lots of constraints (except of box bounds); some problems now work many times or orders faster (of course not faster than vectorized problems with insufficient number of variable arrays). It is recommended to retest your large-scale problems with useSparse = 'auto' | True| False
  • Two new methods for splines to check their quality: plot and residual
  • Solving ODE dy/dt = f(t) with specifiable accuracy by interalg
  • Speedup for solving 1-dimensional IP by interalg

SpaceFuncs and DerApproximator:

  • Some code cleanup

You may trace OpenOpt development information in our recently created entries in Twitter and Facebook, see http://openopt.org for details.

See also: FuturePlans, this release announcement in OpenOpt forum

Regards, Dmitrey.

Expanded list of online courses for data analysis

The folks at Stanford have been really busy putting together online curriculum for the world to learn.  This is a followup to a previous post Machine Learning for everyone.  Stanford has included a bunch of other courses that they will promote online.

Some of the interesting courses are
  • Model Thinking
  • Natural Language Processing
  • Game Theory
  • Design and Analysis of Algorithms 
Apart from the promoted online courses from Stanford there is also other courses of interest that are not promoted but still access to lectures and notes.  Notable courses from the Stanford School of Engineering include the following.
  • Introduction to Linear Dynamical Systems
  • Convex Optimization I
  • Convex Optimization II
Stanford isn't the only school that is promoting their lectures online for free use.  A lot of schools are promiting open learning and collaboration through what is called Open Courseware.  Some notable schools inlcuded.
As an analytics professional for many years I've found honing your skills to be very important for your career.  Now more than ever it is easier to do with schools opening up their classes for everyone.  I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.

Thursday, November 10, 2011

My learning as a Data Scientist

So apparently the new en-vogue title is Data Scientist.  I can now include that to my already expanding list of titles.  In the past I've been known as an Engineer, Operations Analyst, Production Control Specialist, and an Analytics Analyst.  Now I'm considered a Data Scientist.  It's all the same to me.  My training and expertise has allowed me to problem solve many challenges within organizations.  The title doesn't matter.  There are opportunities for people with my skill set.

A recent blog post by Kontagent Kaleidoscope about Big Data is Useless without Science got me thinking about my role as a self-proclaimed Data Scientist.  The blog article points out a need for the science of better decision making.  Organizations are looking for people to help them turn their data mines into information gold.  I've definitely learned a lot over the years as a Data Scientist and I thought I would list some of those learnings.

1.  Organizations Don't Know What a Data Scientist Can Do

The idea here is marketing your own talents.  The Data Scientist needs to put their methods and work out there for the organization to see and touch.  This means working with the peers and management in the organization.  The Data Scientist needs to be able to eloquently relate methods, problems, challenges and how they can be solved.  Important skills here are personal marketing and communication.  I know this goes against the grain of many numbers geeks like me.

2.  Problems Don't Solve Themselves

Opportunities for solving real problems in an organization are always around.  The trick is being in the right place at the right time to be able to solve those problems.  Organizations have hoarded a lot of data and many times they don't even remember why.  The Data Scientist needs to turn into a Data Detective.  Explore all aspects of the organization.  Interview different departments and see how they tick and ask questions like "What keeps you up at night about your job?".  I was often surprised how a simple solution would go a long way to helping someone else out.  This develops true collaboration and leads to bigger problems to solve.

3.  Always Continue to Learn New Things

The world is constantly evolving and there are always new tools, tricks, methods, algorithms, software and mechanisms.  The Data Scientist needs to be able to adapt to new technologies.  I've found its best to stay current with whats new in order to stay sharp and meet new demands.  The internet can be your friend.  Even keeping up with a favorite list of blogs can help with staying current.  Times change and so do organization's needs.  Perhaps this is just me but I love learning new things as it creates a fun diversion and improves my skill sets.



Tuesday, October 11, 2011

Top 50 Statistics blogs of 2011

TheBestColleges.org published their list of the top 50 statistics blogs.  This is a really good resource list of statistical analysis and news.

Monday, October 3, 2011

Data mining the Federal Reserve

The Federal Reserve now has the ability to have its data programmatically retrieved.  The St. Louis Fed Web Services allows programmers and data scientists to retrieve key economic data from their libraries.  I have not had a chance to peruse the site at all but this can be a really interesting source of data.  The age of Open Data is really upon us.  This can lead to some really interesting research for professionals and amateur scientists.

Monday, September 26, 2011

Machine Learning for everyone

Well maybe mostly everyone. Have  you been interested in gaining knowledge in the latest craze of artificial intelligence and computing?  Then go no further than Stanford's Machine Learning course which is now open enrollment to everyone!  Andrew Ng is back to provide the world with knowledge about Machine Learning for the entire masses. 

Per Stanford's website, Machine Learning is data mining and statistical pattern recognition.  Mostly it is applying mathematical and statistical methods to draw out information behaviors from data sources.  So do you want to invent the next Netflix, Amazon or Google?  This is the course for you.

If you do not want to enroll in the Machine Learning class you could always watch some of the older lectures online.  Andrew Ng provides plenty of information from past lectures with student contributed projects.  The CS 229 website is worth a look for a punch of Machine Learning related resources.

Friday, September 23, 2011

Data Driven Success in Professional Baseball

An interesting article from Data Center Knowledge about the presentation Paul DePodesta gave at the Strata Summit.  Paul DePodesta is known for bringing mathematic and analytical know-how to Billy Beane and the major league professional baseball team Oakland Athletics.  His story was accounted by Michael Lewis in the book "Moneyball" and is being portrayed with the same name on the big screen opening this weekend.

I really liked this quote from Paul in the article.

We didn’t solve baseball. But we reduced the inefficiency of our decision making.
Is that not the sort of things that an analytical professional or an Operations Researcher ultimately tries to do?   Operations Research is not the art of creating anything new.  It is the art of creating existing things better.  All decision making is inefficient to some point.  Even the right decision can be inefficient on some level.  Decisions are full of balancing acts of constraints and feasibility.

Also this proves that no industry or organization is absent of a need for efficient decision making.  Even baseball can us a dose of improved decision analysis.  Whether is scheduling the league or determining the best pitcher for their value.  Sports has definitely come into their own with decision analytics.  I'm eager to watch Paul's career and wonder if analytics is taking it to the next level.

Thursday, September 15, 2011

OpenOpt Suite 0.36

New release of the free BSD-licensed software OpenOpt Suite is out:

OpenOpt:

* Now solver interalg can handle all types of constraints and integration problems

* Some minor improvements and code cleanup

FuncDesigner:

* Interval analysis now can involve min, max and 1-d monotone splines R -> R of 1st and 3rd order

* Some bugfixes and improvements

SpaceFuncs:

* Some minor changes

DerApproximator:

* Some improvements for obtaining derivatives in points from R^n where left or right derivative for a variable is absent, especially for stencil > 1

See http://openopt.org for more details.


Wednesday, August 31, 2011

Physicist cuts airplane boarding time in half

I have always been fascinated with the airplane boarding problem.  I wish I was in the airline industry because I would love to tackle this problem.  I used to travel a lot for my job and I would marvel at how inefficient the time  it took to board an airliner.  My first inclination is to redesign the plane (and the airport jetway) to include exits at middle and rear of the plane to go along with the forward exit.  Yet I never put my ideas to paper and tried to calculate efficiency gains.  There have been a lot of ideas try to find the optimal boarding arrangement.  Would you believe that random boarding, i.e. Southwest Airlines, is a more optimal boarding procedure then the current row assignment method?

Yet a physicist from Fermilab, Jason Steffen, did have some interesting ideas to improve the existing airplane boarding procedures.  By using Monte Carlo simulations to measure efficiency and test his ideas he was able to improve airplane boarding by as much as half the time.  From the article, his methods were to using sections of window seats first but alternate aisles so passengers would not interfere with each other.

This is a very clever idea.  Yet I found one flaw that may not have been assumed in his study.  I've noticed that overhead space is a premium for passengers, especially for business travelers.  Business travelers often bring two carry-on bags.  These bags tend to fill up the overhead bins rather quickly.  When the overhead bins fill up then passengers have to search in the aisles looking for available space for their bags.  This creates a bottleneck and queues develop for the other boarding passengers.  It seems to me that Jason's study makes an assumption that all overhead bins would be available at time of boarding.  If in fact alternating rows are used in his model than overhead bins might become filled to capacity before passengers board and create more bottlenecks.  Its just one theory that would be worth investigating before Jason's procedures are implemented.

I applaud Dr. Steffen's studies and finds in the airplane boarding problem.  It is a fascinating problem as most of us have encountered airplane boarding from time to time.  For more information on his methods you can read about Jason's work airplane boarding, which is very fascinating, on his website.

Monday, July 25, 2011

What did we learn from the Space Shuttle program

The NASA Space Shuttle program ended this past week with STS-135.  I still remember watching the shuttle launches as a boy.  They filled my head with dreams of space exploration and new discoveries.  As I grew older I took an interest in engineering and studied to be one in a university.  There I discovered the enormity of the engineering marvel that was the Space Shuttle program.

Discover published an article this week on what was the debacle of the Space Shuttle program.  A lot of good and interesting points made by Amos Zeeberg in this article.  The Space Shuttle was originally designed to be a cost effective way of getting man and technology into space.  The program definitely did not deliver on that promise or projection.  Also the Space Shuttle was considered to have only a risk of failure of 1 in 100,000.  I don't know if that is remotely true.  As we unfortunately know the true risk of failure was 2 in 135.  Space travel is risky no matter how it is done.

So from an engineer's point of view, albeit one that was not involved with the space program, what can we really learn from the Shuttle Program.  I believe applying Industrial Engineer and Operations Research principles we could come to some conclusions.  I don't personally think the Shuttle missions were a total debacle.  As an Engineer there is always something to learn even if there is a failure.  Edison said it best that he didn't fail 1000 times trying to develop a light bulb, only he learned 1000 ways on how not to build one.

Firstly, risk needs to be measured from a micro and macro perspective.  There are many systems that lead to failure.  Each system has a life all of its own.  The risk could be as simple as an O-Ring to as complicated as a practical study of landing on the Moon.  Risks can be measured and weighed from different perspectives of time, cost, and quality of delivery of promise.  When all risks are measured than perspective can be put into place as to delivery of a promise.  Perhaps the Shuttle program didn't deliver on all promises.  Yet it did prove many things that reusable vehicles were ahead of its time.  We can learn a lot from the Shuttle Program on examining risks of promise and making sure that we evaluate different objectives and goals.

Secondly, engineering and management should be a cultivated relationship that needs to understand each others' strengths and goals.  Engineering has the design in its best interest.  Management has the mission in its best interest.  The design and mission are unique and have there own set of goals.  Yes there are going to be risks weighed in both the design and mission.  The complexity is when merging the risks of the design and mission together.  The magnitude of the NASA Space Shuttle Program magnified the relationship between engineering and management.  The best and the worst was brought to light.  The engineering marvel of creating a reusable vehicle is magnificent.  The managerial feat of sending man into space with a reusable vehicle on more than 100 missions is not insignificant.  The importance of merging design and mission together was a great learning experience with the Space Shuttle program.  We have already seen fruits of that success.  Missions to Mars and beyond the Solar System have proven that success.

The NASA Shuttle Program was not an outright debacle.  There was a lot to learn from the process.  No it did not deliver on all initial expectations.  Yet it did deliver on this young boy's dreams of discovery and knowledge.  Once an Engineer, always an Engineer.  I hope that we will never cease to learn and improve from our failures.

Monday, July 4, 2011

Problems with data visualizations followup

On a recent article post I was showing a bad data visualization chart from The Economist.  While reading over Slashdot I found a similar bad data visualization article about bad visualizations from BP and GE as presented by Stephen Few's blog.  No doubt a lot of people share in the same frustration.

Now that we are in the Insight Age it seems that we will continually question and interpret how data will be presented to us.  We are now data rich but knowledge poor.  I believe there is going to be vast new opportunities to help disseminate the data.  Perhaps even ways to help visualize the data as well. 

I strongly suggest reading Stephen Few's blog.  It is an interesting read on how data visualization can be used poorly.  He even shows examples on how to do it correctly.

Thursday, June 30, 2011

How not to do data visualization

I was glancing over Hacker News and came across an article from the Economist Daily Chart blog.  The daily chart was about nations debt management.  The following can be shown here.






Seems innocent enough.  It shows in declining order the debt per nation.  What a second?  Why is Ireland have more debt than USA? After reading the article more thoroughly it looks like it is a percentage of GDP.  What a third time?  Is the bar graph the percentage of GDP or the number in the white box a percentage of GDP?  And how does this relate to debt management?  So apparently in the article it explains the change in primary balance for each nation to be 60% of GDP.  So the bar graph is a % change of GDP to get to 60% of GDP.  Are we crystal?  I'm not sure I totally understand but that is my basic understanding.

Data visualization is important in Analytics and Operations Research.  We need to model real world applications quite a lot.  Often times there is no better way to do this than to use a chart or graph.  The real art is conveying the crux of the message to the recipient.  There is an internet meme devoted to the art of bad chart making.  I feel bad using the Economist as an example because after all I did finally (I think) come away with the right idea.  But still notice how there are no data or axis labels across the top of the chart.  Also the numbers in the white boxes are not given any units.  I'm still not sure if those numbers in the white box are a percentage or a debt value.  Sometimes the visual art clutters the real message.  It is important to make sure that recipient has the right frame of reference and can understand each graphic and label.

Sunday, June 26, 2011

Recommended Machine Learning blogs

I happened upon an OSQA site called metaoptimize about recommended Machine Learning blogs.  There were a lot of blogs listed on this site that I had not seen before so it really got me interested.

Good Machine Learning blogs

Machine Learning is the scientific process of developing algorithms for computers to evolve based on empirical data.  For instance one may develop a decision tree that helps predict a certain behavior from a data set.  The decision tree itself is just a method to predict behavior.  Yet perhaps more data can be acquired and more behaviors can be realized.  Then the decision tree is computed again based on the newer data (and perhaps combined with the older).  New behaviors are learned from the newer data and a new implementation of the decision tree is evolved for new behaviors.  This process becomes algorithmic and continues.

Machine Learning developed out of the field of Artificial Intelligence.  The idea of having computers learn has been around since as long as computers itself.  Machine Learning is really starting to develop as computing power has caught up to theory.  Machine Learning has a lot of uses and may be used by some of your favorite computer applications.  Some examples include product recommendation systems like Amazon or Netflix, search engines like Google or Bing.  Machine Learning is seeing practical uses in many places and its only just touching the surface.

Monday, June 20, 2011

Moneyball coming to the big screen

Recently I found out that the book Moneyball by Michael Lewis will be shown as a motion picture.  The Moneyball trailer can be viewed online.  In case you have never heard of Michael Lewis then you might have heard about the movie "The Blind Side" which he also wrote the accompanying book.  The book Moneyball is about the Oakland Athletics and how they used analytics and mathematcal know-how to turn around a professional baseball franchise.

The story centers around Billy Beane which is played by Brad Pitt in the movie.  Billy Beane is a professional ballplayer turned General Manager.  Billy Beane inherits the top organizational management job for the losing Oakland Athletics.  He is immediately frustrated with the same old losing ways and believes he needs to shake up the system.  He finds out about the curious world of baseball analytics or otherwise know as sabermetrics and hires a curious crew of young mathematically gifted folks.

The story is fascinating even if you are not a fan of baseball.  The use of mathematics to help make business decisions is nothing new.  Yet employing this analytics method to an industry that is deep rooted in old ways and practices is intriguing.  Changing the ways of the "good ole boy" network requires risk, knowledge, and sometimes good fortune.  This can translate to almost any industry or even organization.  I am most definitely looking forward to seeing this movie.

Thursday, June 16, 2011

OpenOpt Suite 0.34

I'm glad to inform you about new quarterly release 0.34 of our free OOSuite package software (OpenOpt, FuncDesigner, SpaceFuncs, DerApproximator) .

Main changes:

* Python 3 compatibility

* Lots of improvements and speedup for interval calculations

* Now interalg can obtain all solutions of nonlinear equation (example) or systems of them (example) in the involved box lb_i <= x_i <= ub_i (bounds can be very large), possibly constrained (e.g. sin(x) + cos(y+x) > 0.5 or [sin(i*x) + y/i < i for i in range(100)] )

* Many other improvements and speedup for interalg

Regards, D.

Monday, June 13, 2011

Analytics geeks win NBA championships

The Dallas Mavericks win their franchise first NBA Title.  They won their first championship by beating teams that everyone thought they could not beat.  The Mavericks were able to beat juggernauts like the Los Angeles Lakers, a fast paced Portland Trailblazers team, up-and-coming youngster superstars in the Oklahoma City Thunder and of course the Big Three from the Miami Heat.  As good as the Mavericks were executing on the basketball court they were equally as good executing a between-the-ears approach to basketball.  The Mavericks were able to win the game by studying the numbers of professional basketball.  Some of the champions on the Mavericks team may not be able to hit even 10% of three pointers but they sure know how to analyze a winning combination.

The analytics culture starts with Dallas Mavericks owner Mark Cuban.  According to ESPN when Mark Cuban was looking for a coach he studied games and found out that Rick Carlisle used the most efficient lineups most frequently.  Mark Cuban hiring Rick Carlisle to coach the Mavericks was a no-brainer because the numbers do not lie.  As for Rick Carlisle, he is known for being a very cerebral coach and very handy with crunching NBA statistics as well. 

Another known fact about the Dallas Mavericks is that they use an analytics staff to gain a competitive edge.  Most recently they have retained the NBA analytics stat guru Roland Beech of 82games.com.  In the past they had used the services of Wayne Winston, an Operations Research professor, to help analyze their lineups to be more competitive.

Mark Cuban gives a lot of attention to the geeks for Mavericks winning.  From the ESPN article

I give a lot of credit to Coach Carlisle for putting Roland on the bench and interfacing with him, and making sure we understood exactly what was going on. Knowing what lineups work, what the issues were in terms of play calls and training.

That is a lot of brainpower on the bench in every game.  It is good to see the geeks get their due.  Way to go Mavericks and looking forward to seeing what the geeks put on the court next season!

Monday, May 23, 2011

Sports analytics summer blog reading recommendations

The dog days of summer are almost here and if you are a sports fan it can be long.  Only baseball and soccer endure the summer seasons in the U.S.  Even if you are a die hard baseball or soccer fan the season itself can seem to last forever.  Now is the perfect time to get caught up in the analytics of your favorite spectator sports.  The following is some of my favorite sports analytics blogs and reading material.

Baseball
FanGraphs.com
FanGraphs is the all everything baseball numbers website.  The best thing that FanGraphs is known for is having a complete database of baseball players metrics.  One of my favorite metrics in baseball is WAR or Wins Above Replacement.  If that is not enough they even have heat maps of strike zone pitching locations.  Tracking your favorite team has never been more analytically exciting.

Football
AdvancedNFLstats.com
Advanced NFL Stats is the best NFL analytics blog out there right now.  Similar to FanGraphs there is a complete database of NFL offense and defense metrics.  Advanced NFL Stats also does a good job of explaining the numbers behind the measurements.  Football is no easy task to analyze team and player performance.  This site does an excellent job of both.  Also Advanced NFL Stats is keeping a database of play-by-play data.

Drive-by-Football
The up and comer of the NFL analytics blogs is Drive-By Football.  Drive-By does a great job of explaining some of the harder math around determing team and player efficiency.  One of the most interesting features is the Markov Chain Drive calculator which calculates likelihood of scoring scenarios drive-by-drive hence the name of the website.

Basketball
Wayne Winston blog
This blog's primary focus is on Basketball, specifically the NBA.  Wayne Winston is definitely known as a prolific Operations Research professor.  You may not know is that Wayne Winston consulted the Dallas Mavericks and other sports teams to help improve their franchises.  Wayne talks about other sports from time to time as well.  If you have not read Wayne Winston's book Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football you are in for an analytical treat.   Wayne analyzes the why and how of measuring professional sports efficiency and winning.

Tuesday, May 17, 2011

In Memorium of Dr. Paul Jensen

I received discouraging news last week that we lost a colleague in the Operations Research and INFORMS community.  Dr. Paul Jensen passed away peacefully on April 4, 2011.  Dr. Jensen served a number of years at the Univ. of Texas in Austin as a great contributor to the Operations Research community and researcher.  As recently as 2007 Dr. Jensen was awarded the INFORMS Prize for the Teaching of ORMS Practice.

I unfortunately did not know Dr. Jensen personally.  I was first introduced to his ORMM website through my graduate courses at SMU.  The ORMM website is a great resource to help educate the principles of Operations Research methods.  I was also able to use some of his Excel modeling add-ons in practice to demonstrate optimization problems.

Dr. James Cochran is going to hold a special session in memorium of Dr. Jensen.  This message from Dr. Cochran was sent on Dr. Jensen's ORMM mailing list.

Dear friends and colleagues,

Paul was a good friend and colleague.  I know each of us will miss him (as will many other friends throughout the OR community) and each of us is very sorry for the loss suffered by Margaret and the rest of Paul's family.

I will chair a special INFORM-ED sponsored session in Paul's memory at the 2011 INFORMS Conference in Charlotte (November 13-16).  Several of Paul's many friends will speak on his contributions to operations research education and share personal stories and remembrances about Paul.  Margaret and Paul's children will be invited to attend, and I hope each of you will also be able to attend (I'll try to reserve some time at the end of the session during which members of the audience will have an opportunity to share their thoughts).

INFORMS Transactions on Education (the online journal for which I am Editor in Chief) will also publish a special issue devoted to Paul's influence on OR education.  Dave Morton has kindly agreed to edit this special issue, so I am certain it will be a fine tribute to Paul.

Sincerely,

Jim

Monday, May 16, 2011

Welcome to the Insight Age

We are in the midst of the Insight Age.  In other words the end of the Information Age.  This has been explained on HPCwire quoting HP Labs distinguished technologist Parthasarathy Ranganathan.  No longer are we seeking ways to process information.  We are seeking ways to disseminate and draw conclusions from the information we already hold.

Does this sound familiar to anyone in Operations Research?  It should because this is what Operations Research has been doing for years.  I think I sound like a broken record sometimes.  Yet I guess the story needs to be told again.  But perhaps I'm being a little too snarky.  It could just mean that the Information Age is catching up to the decision science analysts.

The crux of the article is technology meeting the demands of information overload.  Yet that is not what the definition of insight is to me.  Insight is drawing conclusions based on the evidence.  The Operations Research analyst will undoubtedly be well prepared for this evolutionary advancement.  I'm sure HP is aware that technology alone will not help the Insight Age revolution. 

I hope we've all seen this new age coming.  The Insight Age is here and is ready to be tackled.  My next inclination is to think what will define the Insight Age.  The Information Age was defined by the internet, computing power, and globalization.  My prognostication to define the Insight Age is open data and decision science.  Open data is about having no barriers to information.  Data will be freely accessible and easy to disseminate.  Decision science is already here and will make an even bigger impact.  Machine Learning, Artificial Intelligence, Optimization Algorithms will all be the cogs of the Insight Age mechanism. 

Insight Age is such a fitting name.  I'm really liking it the more I think about it.  I'm going to try to remember that in some of my future conversations.

Sunday, May 15, 2011

R Tutorial: Add confidence intervals to dotchart

Recently I was working on a data visualization project.  I wanted to visualize summary statistics by category of the data.  Specifically I wanted to see a simple dispersion of data with confidence intervals for each category of data. 

R is my tool of choice for data visualization.  My audience was a general audience so I didn't want to use boxplots or other density types of visualization methods.  I wanted a simple mean and 95% (~ roughly 2 standard deviations) confidence around the mean.  My method of choice was to use the dotchart function.  Yet that function is limited to showing the data points and not the dispersion of the data.  So I needed to layer in the confidence intervals. 

The great thing about R is that the functions and objects are pretty much layered.  I can create one R object and add to it as I see fit.  This is mainly true with most plotting functions in R.  I knew that I could use the lines function to add lines to an existing plot.  This method worked great for my simplistic plot and adds another tool to my R toolbox.

Here is the example dotchart with confidence intervals R script using the "mtcars" dataset that is provided with any R installation.


### Create data frame with mean and std dev
x <- data.frame(mean=tapply(mtcars$mpg, list(mtcars$cyl), mean), sd=tapply(mtcars$mpg, list(mtcars$cyl), sd) )

###  Add lower and upper levels of confidence intervals
x$LL <- x$mean-2*x$sd
x$UL <- x$mean+2*x$sd

### plot dotchart with confidence intervals

title <- "MPG by Num. of Cylinders with 95% Confidence Intervals"

dotchart(x$mean, col="blue", xlim=c(floor(min(x$LL)/10)*10, ceiling(max(x$UL)/10)*10), main=title )

for (i in 1:nrow(x)){
    lines(x=c(x$LL[i],x$UL[i]), y=c(i,i))
}
grid()






And here is the example of the finished product.

Tuesday, May 3, 2011

Google funding research to measure regret

According to an article in Mashable Google is funding Artificial Intelligence research at Tel Aviv University that will help determine if computers could be taught regret.  My first inclination is to wonder if this is really anything new.  Linear programming itself is all about regret or, in financial terms, opportunity cost.  From the article it describes the research is about how to
measure the distance between a desired outcome and the actual outcome, which can be interpreted as “virtual regret.” 
That sounds a lot like mathematical programming to me.  So what is so different about the Tel Aviv Universtity findings?  Apparently its not something new with the algorithms but more or less new with how the data is processed.  Dr. Yishay Mansour explains that they will be using machine learning methodologies to look at all the relevant variables in advance of making informed decisions.  This sounds more like this research is in the realm of how to understand large amounts of data and processing it into usable information. 

Big data is a huge problem in the data rich but information lacking internet environment that we face today.  There is a lot of data handled by organizations but they need to know what to do with it.  Today's Operations Research professional should be perched to swoop in an help this issue.  Organizations are data rich but lack the focus to apply it to meaningful decision analysis.  I'm hoping that this is only going to lead to a big watershed moment for the Operations Research community.

Thursday, April 21, 2011

Open Source replacements for Operations Research and Analytics Software

I was reading an article from Datamation on 70 Open Source Replacements for Small Business when I noticed a glaring omission.  Where are the software applications for Operations Research and Analytics?  So here is my best addendum to this article that should complete what small business should know about Open Source analytics productivity software.

Statistics and Computation

1.  R Project

Replaces: SAS, SPSS

R is a free and open source statistical computing environment that holds its own against some of the most established proprietary statistical environments.  R is available on all operating systems and is free for download.  R also has a community driven library of add-on packages that are also freely available and cover almost any statistical, mathematical, or optimization need.

Also a great reference manual for those switching from SAS to R is SAS and R: Data Management, Statistical Analysis, and Graphics

SAS and R: Data Management, Statistical Analysis, and Graphics


2.  RapidMiner

Replaces:  KnowledgeSEEKER

RapidMiner is a data mining software with a graphical front-end.  RapidMiner is suitable for most data mining and data transformation needs.


Mathematical Programming and Optimization

3.  GLPK

Replaces:  AMPL

GLPK is a GNU/free software linear programming software kit.  GLPK is intended for large-scale linear programming, mixed integer programming.  GLPK is based on GNU MathProg (or GMPL) which is considered a subset of the AMPL syntax.  GLPK also has its own solver.

4.  Symphony

Replaces: CPLEX, Gurobi

Symphony is a mixed integer linear programming solver developed under COIN-OR.  Symphony is a flexible framework that offers many methods to customize solver capabilities given problem sets.


5.  OpenSolver

Replaces:  Excel Solver

OpenSolver is a linear an integer optimizer alternative to the Excel Solver in Microsoft Excel.  OpenSolver is based on the COIN-OR CBC engine.  Unlike the Excel Solver there is no software limits to the size of the problem that can be solved.

Tuesday, April 12, 2011

OReilly - Quiet Rise of Machine Learning

An interesting article from the OReilly Radar blog by Jenn Webb on the Quiet Rise of Machine Learning.  I love the insight from the article that decision sciences like Operations Research are getting more mainstream.  Machine Learning is that one science of mixing data mining and predictive analytics.  The methodologies of machine learning is nothing new.  Computers have been able to keep up with the mathematics and now skilled scientists are using these techniques all over industry. 

Yet to me what is interesting in this article is that implies that machine learning sciences is rising from basically nothingness.  As if this is some sort of new fangled technology developed by IBM for a special man vs. machine Jeopardy act.  I guess I'm a little too close to the Operations Research community to know where the roots really lie.  For one I'm happy that decision sciences like machine learning are getting more and more recognition.  On the other hand I'm thinking "Where have you been since World War II?".  I guess I'm a little too cynical lately.

I love the OReilly Radar blog as it seems more and more articles are about the promise of data analytics.  I guess I'm just wishing for a little more investigative reporting.  In fact I think it would benefit INFORMS if they partnered with OReilly Media.  OReilly definitely has a focus on analytics now and INFORMS is prime to provide a lot of great content for discussion. 

Wednesday, March 30, 2011

Baseball and Decision Analytics

When its spring time it means that baseball season is getting ready to get started.  On the last day of March is when the 2011 Major League Baseball season gets going.  Baseball is as American as apple pie and almost every baseball enthusiast has something to say about the game.  Analytics professionals are not far behind when it comes to opinions on baseball. 

Baseball is definitely a numbers game.  Mathematicians have been studying baseball for as long as the game itself has been played.  One of the first notable baseball analysts to apply decision analysis was Bill James.  Bill coined the study of baseball analysis as sabermetrics which is taken from the acronym of the Society of American Baseball Research.  More recently baseball decision analysis has found its way to the Major League Baseball teams management offices.  Popular books such as Moneyball by Michael Lewis and The Extra 2%: How Wall Street Strategies Took a Major League Baseball Team from Worst to First by Jonah Keri have shown how major league management turned around poor performaning clubs into championship contenders.  The mathematics behind their decision analysis can be described best by Wayne Winston's book called Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football.

Baseball decision analysis has grown up since Bill James devised the batting average.  Now baseball decision analysis uses techniques such as replacement value.  The Value Over Replacement determines the value of a player given that player would be replaced by an average or run-of-the-mill at the given player's position.  Value Over Replacement was made popular by Keith Woolner, the author of the Baseball Prospectus 2011.  At first the value, which is usually offensive value, was to determine how many runs a player could produce over an average player.  Now value over replacement methodologies determine how many wins a player can generate for their respective team.  One of the best sites to give WAR analysis, or Wins Over Replacement, is Fangraphs.  Fangraphs has about every major statistic on baseball available for the baseball enthusiast.  In fact they even have heat maps for pitch location.  Ready to manage your own team yet?

Pitch location heat map from Fangraphs.com


Of course all of this decision analysis would not be possible without the numbers.  One of the best places for baseball data is Baseball-Reference.com.  Just about every data point on baseball can be mined from the site and downloaded.  So if you have a craving to create your own baseball metric or analytics strategy there should be nothing stopping you.

This is another post in the INFORMS Online Blog Challenge.  This month is O.R. and Sports. 

Tuesday, March 22, 2011

R again in Google Summer of Code

I'm a big fan of the Google Summer of Code.  It brings great projects together with a learning opportunity for students.  Once again the R Project was selected to be part of the Google Summer of Code in 2011.  Some other notable mathematical and statistics projects with R include Shogun Machine Learning, SymPy, GambitComputational Geometry Algorithms Lab, Orange, and Computational Science and Engineering.

The Google Summer of Code has really grown over the years.  I'm glad to see that these open source initiatives really help teach our younger generation. 

Wednesday, March 16, 2011

OpenOpt Suite release 0.33

New release 0.33 of OpenOpt Suite is out:

OpenOpt:

  • cplex has been connected
  • New global solver interalg with guarantied precision, competitor to LGO, BARON, MATLAB's intsolver and Direct (also can work in inexact mode), can work with non-Lipschitz and even some discontinuous functions
  • New solver amsg2p for unconstrained medium-scaled NLP and NSP

FuncDesigner:

  • Essential speedup for automatic differentiation when vector-variables are involved, for both dense and sparse cases
  • Solving MINLP became available
  • Add uncertainty analysis
  • Add interval analysis
  • Now you can solve systems of equations with automatic determination is the system linear or nonlinear (subjected to given set of free or fixed variables)
  • FD Funcs min and max can work on lists of oofuns
  • Bugfix for sparse SLE (system of linear equations), that slowed down computations and demanded more memory
  • New oofuns angle, cross
  • Using OpenOpt result(oovars) is available, also, start points with oovars() now can be assigned easier

SpaceFuncs (2D, 3D, N-dimensional geometric package with abilities for parametrized calculations, solving systems of geometric equations and numerical optimization with automatic differentiation):

  • Some bugfixes

DerApproximator:

  • Adjusted with some changes in FuncDesigner

For more details visit http://openopt.org.

Saturday, February 19, 2011

IBM has a Natural Language Purpose

I wanted to write a blog post about the advancements of Natural Language Processing in light of the performance of IBM's Watson on the Jeopardy challenge last week.  Natural Language Processing is the science of transforming and interpreting human spoken and written language by artificial means.  Generally this type of study has been limited to academic research due to the high computing power demands.  Now there are even open source software implementations, including many R Natural Language Processing packages.  There is a lot to write about the new advances in NLP.

Instead I came across an interesting editorial on the cheap publicity stunt that is IBM's Watson.  At first I thought the article was a comedy that would make fun of Watson's errors on Jeopardy.  Then I realized the author Colby Cosh is not jesting at all.  This should not be news to me.  The field of Operations Research, which was definitely used to help develop Watson, is a widely misunderstood field.  Cosh has a hard time understanding why IBM would want to develop such a stunt to compete against humans.  Cosh seems to think that the only gain is IBM's shareholders.  I can assure you that if IBM wanted to make money on this venture they would have created a computer that would compete on American Idol.  Jeopardy is no ratings juggernaut in the US.

So what purpose would IBM have for competing on Jeopardy.  Perhaps the idea of "competition" is misleading.  In my eyes I was not seeing if a computer can beat humans in a battle of wits.  I was seeing if a device could interpret, process, and return meaningful information on the same level as human interpretation.  Natural Language Processing is like code breaking.  Similarly mathematics, physics, natural science are like codes to mathematicians, scientists, and engineers.  It is the process of trying to decipher and interpret our natural surroundings.  Language is no different.  I can see it easy for Cosh to think that the sole idea of the competition is to beat humans.  The purpose was simply to decipher the natural language code.  In a better understanding of natural language we can then understand our surroundings a little better.

So why the hype with a computer?
"So why, one might ask, are we still throwing computer power at such tightly delimited tasks,..."
The answer can be found already in the field of Operations Research and Management Science.  Perhaps Cash has purchased a plane ticket in the past few years.  He might have noticed that air transportation has become very affordable due to competitive pricing.  A lot of that is due to optimization and revenue management algorithms in the airline industry.  Perhaps he noticed the increase in quality, service, and price of privatized parcel postage.  The science of better decision making and transportation algorithms have greatly improved supply chain and delivery efficiency.  The list can go on and on.  Artificial Intelligence is probably a poor way of describing computer optimization and machine learning science.  Artificial Intelligence is not going to replace human intelligence but only help improve the human based decisions that we make every day.  IBM has already stated that they wish to improve the medical field with Watson.  Medical diagnosis requires vast amounts of information and Watson can help decipher medical journals, texts, and resources within seconds.  Applications of Watson could be used in third world countries where medical resources are scarce.

I will be looking forward to IBM's advancement with Natural Language Processing.  This offers a new venture into better decision sciences.  Perhaps "smacking into the limits" of artificial intelligence will create a better life for those that use human intelligence every day.

Wednesday, February 16, 2011

Question and Answer sites for Analytics and Operations Research

This post is inspired from a similar post on Jeremy Anglin's blog about statistics question and answer sites.  I thought it would be a good idea to list some of the Operations Research and Analytics focused question and answer sites.  Some of these


Operations Research
OR-Exchange
This site is a stackexchange Q&A Site started by Michael Trick.  A crowd sourcing question and answer resource for anything Operations Research related.  I believe the best one available for Operations Research.

Numerical Optimization Forum
I find it unfortunate that there is so little forums based on Operations Research.  This is one of the few and it is a good one.  It is moderated by IEOR Tools contributor Dmitrey.

I'm purposely leaving out the sci.ops-research Usenet group because I believe its fallen into disarray with spam content.

Math/Statistics
Cross Validated
My favorite stack-exchange site dedicated to statistics.

Math   
Math Overflow  

Software
StackOverflow - R tag    
StackOverlow - SQL tag 

Mailing Lists
Mailing lists do not get as much notoriety as they once did.  Maybe because there are so many other options on the internet for getting information.  I still think they are a valuable resource and a good online community.

R Help Mailing List   http://www.r-project.org/mail.html
GLPK Mailing List   http://lists.gnu.org/mailman/listinfo/help-glpk
COIN-OR Mailing List(s)  http://list.coin-or.org/mailman/listinfo/


Beta StackExchange Sites
These sites might be of interesting to the Operations Research community.  They are not live yet but are looking to generate a following.

http://area51.stackexchange.com/proposals/28815/computational-science
http://area51.stackexchange.com/proposals/1907/numerical-modeling-and-simulation
http://area51.stackexchange.com/proposals/27706/engineering-and-applied-sciences
http://area51.stackexchange.com/proposals/26434/machine-learning
http://area51.stackexchange.com/proposals/24602/data-capture-analysis
http://area51.stackexchange.com/proposals/22964/sas-programming-language
http://area51.stackexchange.com/proposals/18584/engineering-and-scientific-software-tools
http://area51.stackexchange.com/proposals/15237/r-statistical-package
http://area51.stackexchange.com/proposals/9218/operations-research (Interesting.  Do they know about OR-Exchange?)

I would love to see more examples that I can include in this list.

UPDATE:

I forgot to add O'Reilly's Q&A site with the R tag.  http://answers.oreilly.com/tag/R

Friday, February 11, 2011

Science of Matchmaking

The science of matchmaking has seen serious growth in the last few years.  What exactly is so scientific about matchmaking anyway?  The goal of any commercial enterprise (and some public organizations) is to match products or services to the demand of consumers.  The idea of matching consumers with products and services is not new.  Matchmaking is essentially the business art of Marketing.  The science behind the matchmaking has seen the most advancement and improvement in recent time.  Generally speaking computing power has made the difference for the technologically leap forward.  Millions of points of observations and data can be sifted and combed with great ease as compared to even just a decade ago.  There is a scientific magic fiddling to the matchmaking phenomenon (sorry, bad Fiddler On The Roof pun). 


Mathematics of Matchmaking

I'm not sure I can cover all of the math behind the science of matchmaking.  I thought it best to describe an example with the company Netflix.  Netflix wants to make the decision process of selecting movies for its customers easier.  Netflix developed an algorithm to match customers' interests in movies.  In fact they even decided to farm out an improvement to the algorithm in a worldwide contest.  So how does the Netflix algorithm work?  There is a lot of math behind the algorithm but it essentially comes down to finding common features in the customer and movie data.  The customers give Netflix a clue to the features they want by ranking movies the customers enjoy.  This then becomes the dependent variables in the algorithm formulation.  Then the algorithms churn out likely matches based on common feature sets. 

Perhaps one of the best writings on this subject was given by Simon Funk on his blog about his Netflix Contest adventures.  Simon thought a creative way to find features would be to use the matrix transformation process of Singular Value Decomposition.  Traditionally Singular Value Decomposition was used in the microelectronics industry to improve digital signal processing.  Simon wrote up an easy solution for matchmaking movie features with the SVD method which spurned a wave of enthusiasm in the Netflix Contest entrants.


Finding feature sets is not exclusively in the realm of Linear Algebra.  There are also methods of clustering, regression, support vector machine, neural networks, bayesian networks, and decision trees just to name a few.  The science of matchmaking is closely related to artificial intelligence and is commonly referred to machine learning.  Machine learning is using algorithms and mathematical methods to evolve and generate behaviors from data in order to solve a problem.


Processing the Matchmaking Data

The science of matchmaking would not be complete without the data.  The advent of the internet has opened a lot of new enterprises that makes use of millions of data observations.  These internet companies have a lot of data to process in huge server arrays that will make even the ENIAC envious.  So how do these companies process all of this matchmaking data with their matchmaking algorithms?  The basic answer is to break it down into manageable chunks.  Perhaps no greater example is Google and their MapReduce methods.  MapReduce is a software framework process that takes a large computing need and breaks it down into a distributed network that is more manageable.  The first step in the MapReduce process is to Map the data.  The Map process is to organize and distribute the data to computing nodes, usually a huge cluster.  The Reduce process is to apply the algorithm or learning process to a node in the network and determine an answer to the data its given.  This essentially gives it a local optimum.  This process is iterated until a globally learned optimum is achieved.  This is a very cut and dry description but you get the idea.


The MapReduce software framework is proprietary to Google.  That has not stopped software enthusiasts.  An open source MapReduce method was created called Hadoop and is growing into a stronger user supported community.


So what can be used with the science of matchmaking?  Really anything the heart desires (okay, again, that was bad).  Amazon.com uses recommendation algorithms for its books and products.   Online dating sites (how appropriate) uses matchmaking methods for matching interested daters.  Search engines like Google uses matchmaking algorithms, known as PageRank, to match search keywords with websites.  As you can tell these types of enterprises are doing very well thanks to the science of matchmaking.



This article is part of the INFORMS Online blog challenge.  February's blog challenge is Operations Research and Love.

Wednesday, February 9, 2011

Data Mining Books List

I came across a great list of Data Mining Books while perusing around the internet.  The list is maintained by Kurt Thearling who is Director and Chief Scientist in various organizations helping to develop their Analytics and Engineering groups.  Kurt has written some white papers on the subject of Data Mining and has also been featured on NPR.  Kurt's NPR piece was about data mining and privacy which is obviously a big subject in our Facebook society today.

I believe this is probably one the the most comprehensive lists of Data Mining books available.  If you are interested in obtaining one of these books please be sure to peruse the the new IEOR Tools Online Store Data Mining section.  There you can find books and references on Data Mining with varying levels from introductory to advanced applications.

Thursday, January 27, 2011

Operations Research and gerrymandering

INFORMS blog challenge for January involves Operations Research and politics.  While public debate on politics will never change perhaps one thing that can change is the involvement better decision making to solving some of the political discourse.

One area of politics is the topic of gerrymandering.  For those that are not familiar to gerrymandering it is the process of resetting electoral boundaries for voting purposes.  Gerrymandering is a hot topic around any election because it is usually the party in power controls the rights to reset the electoral boundaries.  This leads to an obvious advantage to the party as they can maintain a seat in a legislature with setting boundaries based on past voting behavior.

Operations Research can be a valuable asset to the process of redistricting.  In fact Operations Research has been very much involved in redistricting for at least 50 years.  Decisions to draw electoral lines can follow any number of constructions including demographics, population centers, municipality boundaries or industry types.  As information abounds more freely there is more opportunity to use it for decision making.  It seems every new census brings more available data.  The growth of the internet has allowed information to be available more openly.  Opportunities should grow in Operations Research to provide redistricting decision makers the information for informed analysis.

Perhaps one of the better uses of Operations Research could be the ethical context of the gerrymandering debate.  I have often heard it debated that Operations Research may have created the politically polarized country we have today.  The same tools of Operations Research could be used to allow transparency in the redistricting process.  It could be useful for citizens to know how probable or likely outcomes of elections based on redistricting suggestions.  Websites like OpenSecrets.org shows how money influences party affiliation and elections.   Perhaps similar websites can emerge on electoral districts and the legislation that helped create them.

The debate of gerrymandering will last for centuries I am sure.  I believe Operations Research can play a vital part in the debate.   Information is more open and easy to access than ever.  Let's use that to our best ability and help inform the voting electorate.