- Open Source Replacements for Operations Research and Analytics Software
- R Tutorial: Add confidence intervals to a dot chart
- Science of Matchmaking
- Data Mining Books list
- Physicist cuts airplane boarding time in half
- R again in Google Summer of Code
- Moneyball coming to the big screen
- Baseball and Decision Analytics
- Question and Answer sites for Analytics and Operations Research
- Sports analytics summer blog reading recommendations
Friday, December 30, 2011
Most popular 2011 IEOR Tools blog articles
Wednesday, December 21, 2011
Visualizing categorical data in R
The following picture is the result of the logodds function in R. The chart is really close but not quite exact. For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.
The following is the R script.
logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){
# Assumptions: x & y are numeric vectors of the same
# length, y is 0/1 varible. This returns a vector
# of breaks of the x variable where each bin has at
# least min.countnumber of y's
bin.by.other.count <- function(x, other, min.cnt=1) {
csum <- cumsum(tapply(other, x, sum))
breaks <- numeric(0)
i <- 1
breaks[i] <- as.numeric(names(csum)[1])
cursum <- csum[1]
for ( a in names(csum) ) {
if ( csum[a] - cursum >= min.cnt ) {
i <- i + 1
breaks[i] <- as.numeric(a)
cursum <- csum[a]
}
}
breaks
}
brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
# Visualizing binary categorical data
var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
var_mean <- tapply(data_dep, var_cut, mean)
var_median <- tapply(data_ind, var_cut, median)
mydf <- data.frame(ind=data_ind, dep=data_dep)
fit <- glm(dep ~ ind, data=mydf, family=binomial())
pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
type="response", se.fit=T)
# Plot
plot(x=var_median, y=var_mean, ylim=c(0,1.15),
xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
stripchart(data_ind[data_dep==0], method="stack",
at=0, add=T, col="grey")
stripchart(data_ind[data_dep==1], method="stack",
at=1, add=T, col="grey")
lines(x=min(data_ind):max(data_ind),
y=pred$fit, col="blue", lwd=2)
lines(lowess(x=var_median,
y=var_mean, f=.30), col="red")
lines(x=min(data_ind):max(data_ind),
y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
lines(x=min(data_ind):max(data_ind),
y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}
logoddsFnc(icu$age, icu$died, "age", min.count=3)
The ICU data for this example can be found in the R package "vcdExtra". Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function.
The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.
Thursday, December 15, 2011
OpenOpt Suite 0.37
Hi all,
I'm glad to inform you about new release 0.37 (2011-Dec-15) of our free software:
OpenOpt (numerical optimization):
- IPOPT initialization time gap (time till first iteration) for FuncDesigner models has been decreased
- Some improvements and bugfixes for interalg, especially for "search all SNLE solutions" mode (Systems of Non Linear Equations)
- Eigenvalue problems (EIG) (in both OpenOpt and FuncDesigner)
- Equality constraints for GLP (global) solver de
- Some changes for goldenSection ftol stop criterion
FuncDesigner:
- Major sparse Automatic differentiation improvements for badly-vectorized or unvectorized problems with lots of constraints (except of box bounds); some problems now work many times or orders faster (of course not faster than vectorized problems with insufficient number of variable arrays). It is recommended to retest your large-scale problems with useSparse = 'auto' | True| False
- Two new methods for splines to check their quality: plot and residual
- Solving ODE dy/dt = f(t) with specifiable accuracy by interalg
- Speedup for solving 1-dimensional IP by interalg
SpaceFuncs and DerApproximator:
- Some code cleanup
You may trace OpenOpt development information in our recently created entries in Twitter and Facebook, see http://openopt.org for details.
See also: FuturePlans, this release announcement in OpenOpt forum
Regards, Dmitrey.
Expanded list of online courses for data analysis
Some of the interesting courses are
- Model Thinking
- Natural Language Processing
- Game Theory
- Design and Analysis of Algorithms
- Introduction to Linear Dynamical Systems
- Convex Optimization I
- Convex Optimization II
As an analytics professional for many years I've found honing your skills to be very important for your career. Now more than ever it is easier to do with schools opening up their classes for everyone. I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.
Thursday, November 10, 2011
My learning as a Data Scientist
A recent blog post by Kontagent Kaleidoscope about Big Data is Useless without Science got me thinking about my role as a self-proclaimed Data Scientist. The blog article points out a need for the science of better decision making. Organizations are looking for people to help them turn their data mines into information gold. I've definitely learned a lot over the years as a Data Scientist and I thought I would list some of those learnings.
1. Organizations Don't Know What a Data Scientist Can Do
The idea here is marketing your own talents. The Data Scientist needs to put their methods and work out there for the organization to see and touch. This means working with the peers and management in the organization. The Data Scientist needs to be able to eloquently relate methods, problems, challenges and how they can be solved. Important skills here are personal marketing and communication. I know this goes against the grain of many numbers geeks like me.
2. Problems Don't Solve Themselves
Opportunities for solving real problems in an organization are always around. The trick is being in the right place at the right time to be able to solve those problems. Organizations have hoarded a lot of data and many times they don't even remember why. The Data Scientist needs to turn into a Data Detective. Explore all aspects of the organization. Interview different departments and see how they tick and ask questions like "What keeps you up at night about your job?". I was often surprised how a simple solution would go a long way to helping someone else out. This develops true collaboration and leads to bigger problems to solve.
3. Always Continue to Learn New Things
The world is constantly evolving and there are always new tools, tricks, methods, algorithms, software and mechanisms. The Data Scientist needs to be able to adapt to new technologies. I've found its best to stay current with whats new in order to stay sharp and meet new demands. The internet can be your friend. Even keeping up with a favorite list of blogs can help with staying current. Times change and so do organization's needs. Perhaps this is just me but I love learning new things as it creates a fun diversion and improves my skill sets.
Tuesday, October 11, 2011
Top 50 Statistics blogs of 2011
Monday, October 3, 2011
Data mining the Federal Reserve
Monday, September 26, 2011
Machine Learning for everyone
Per Stanford's website, Machine Learning is data mining and statistical pattern recognition. Mostly it is applying mathematical and statistical methods to draw out information behaviors from data sources. So do you want to invent the next Netflix, Amazon or Google? This is the course for you.
If you do not want to enroll in the Machine Learning class you could always watch some of the older lectures online. Andrew Ng provides plenty of information from past lectures with student contributed projects. The CS 229 website is worth a look for a punch of Machine Learning related resources.
Friday, September 23, 2011
Data Driven Success in Professional Baseball
I really liked this quote from Paul in the article.
We didn’t solve baseball. But we reduced the inefficiency of our decision making.Is that not the sort of things that an analytical professional or an Operations Researcher ultimately tries to do? Operations Research is not the art of creating anything new. It is the art of creating existing things better. All decision making is inefficient to some point. Even the right decision can be inefficient on some level. Decisions are full of balancing acts of constraints and feasibility.
Also this proves that no industry or organization is absent of a need for efficient decision making. Even baseball can us a dose of improved decision analysis. Whether is scheduling the league or determining the best pitcher for their value. Sports has definitely come into their own with decision analytics. I'm eager to watch Paul's career and wonder if analytics is taking it to the next level.
Thursday, September 15, 2011
OpenOpt Suite 0.36
OpenOpt:
* Now solver interalg can handle all types of constraints and integration problems
* Some minor improvements and code cleanup
FuncDesigner:
* Interval analysis now can involve min, max and 1-d monotone splines R -> R of 1st and 3rd order
* Some bugfixes and improvements
SpaceFuncs:
* Some minor changes
DerApproximator:
* Some improvements for obtaining derivatives in points from R^n where left or right derivative for a variable is absent, especially for stencil > 1
See http://openopt.org for more details.
Wednesday, August 31, 2011
Physicist cuts airplane boarding time in half
Yet a physicist from Fermilab, Jason Steffen, did have some interesting ideas to improve the existing airplane boarding procedures. By using Monte Carlo simulations to measure efficiency and test his ideas he was able to improve airplane boarding by as much as half the time. From the article, his methods were to using sections of window seats first but alternate aisles so passengers would not interfere with each other.
This is a very clever idea. Yet I found one flaw that may not have been assumed in his study. I've noticed that overhead space is a premium for passengers, especially for business travelers. Business travelers often bring two carry-on bags. These bags tend to fill up the overhead bins rather quickly. When the overhead bins fill up then passengers have to search in the aisles looking for available space for their bags. This creates a bottleneck and queues develop for the other boarding passengers. It seems to me that Jason's study makes an assumption that all overhead bins would be available at time of boarding. If in fact alternating rows are used in his model than overhead bins might become filled to capacity before passengers board and create more bottlenecks. Its just one theory that would be worth investigating before Jason's procedures are implemented.
I applaud Dr. Steffen's studies and finds in the airplane boarding problem. It is a fascinating problem as most of us have encountered airplane boarding from time to time. For more information on his methods you can read about Jason's work airplane boarding, which is very fascinating, on his website.
Monday, July 25, 2011
What did we learn from the Space Shuttle program
Discover published an article this week on what was the debacle of the Space Shuttle program. A lot of good and interesting points made by Amos Zeeberg in this article. The Space Shuttle was originally designed to be a cost effective way of getting man and technology into space. The program definitely did not deliver on that promise or projection. Also the Space Shuttle was considered to have only a risk of failure of 1 in 100,000. I don't know if that is remotely true. As we unfortunately know the true risk of failure was 2 in 135. Space travel is risky no matter how it is done.
So from an engineer's point of view, albeit one that was not involved with the space program, what can we really learn from the Shuttle Program. I believe applying Industrial Engineer and Operations Research principles we could come to some conclusions. I don't personally think the Shuttle missions were a total debacle. As an Engineer there is always something to learn even if there is a failure. Edison said it best that he didn't fail 1000 times trying to develop a light bulb, only he learned 1000 ways on how not to build one.
Firstly, risk needs to be measured from a micro and macro perspective. There are many systems that lead to failure. Each system has a life all of its own. The risk could be as simple as an O-Ring to as complicated as a practical study of landing on the Moon. Risks can be measured and weighed from different perspectives of time, cost, and quality of delivery of promise. When all risks are measured than perspective can be put into place as to delivery of a promise. Perhaps the Shuttle program didn't deliver on all promises. Yet it did prove many things that reusable vehicles were ahead of its time. We can learn a lot from the Shuttle Program on examining risks of promise and making sure that we evaluate different objectives and goals.
Secondly, engineering and management should be a cultivated relationship that needs to understand each others' strengths and goals. Engineering has the design in its best interest. Management has the mission in its best interest. The design and mission are unique and have there own set of goals. Yes there are going to be risks weighed in both the design and mission. The complexity is when merging the risks of the design and mission together. The magnitude of the NASA Space Shuttle Program magnified the relationship between engineering and management. The best and the worst was brought to light. The engineering marvel of creating a reusable vehicle is magnificent. The managerial feat of sending man into space with a reusable vehicle on more than 100 missions is not insignificant. The importance of merging design and mission together was a great learning experience with the Space Shuttle program. We have already seen fruits of that success. Missions to Mars and beyond the Solar System have proven that success.
The NASA Shuttle Program was not an outright debacle. There was a lot to learn from the process. No it did not deliver on all initial expectations. Yet it did deliver on this young boy's dreams of discovery and knowledge. Once an Engineer, always an Engineer. I hope that we will never cease to learn and improve from our failures.
Monday, July 4, 2011
Problems with data visualizations followup
Now that we are in the Insight Age it seems that we will continually question and interpret how data will be presented to us. We are now data rich but knowledge poor. I believe there is going to be vast new opportunities to help disseminate the data. Perhaps even ways to help visualize the data as well.
I strongly suggest reading Stephen Few's blog. It is an interesting read on how data visualization can be used poorly. He even shows examples on how to do it correctly.
Thursday, June 30, 2011
How not to do data visualization
Seems innocent enough. It shows in declining order the debt per nation. What a second? Why is Ireland have more debt than USA? After reading the article more thoroughly it looks like it is a percentage of GDP. What a third time? Is the bar graph the percentage of GDP or the number in the white box a percentage of GDP? And how does this relate to debt management? So apparently in the article it explains the change in primary balance for each nation to be 60% of GDP. So the bar graph is a % change of GDP to get to 60% of GDP. Are we crystal? I'm not sure I totally understand but that is my basic understanding.
Data visualization is important in Analytics and Operations Research. We need to model real world applications quite a lot. Often times there is no better way to do this than to use a chart or graph. The real art is conveying the crux of the message to the recipient. There is an internet meme devoted to the art of bad chart making. I feel bad using the Economist as an example because after all I did finally (I think) come away with the right idea. But still notice how there are no data or axis labels across the top of the chart. Also the numbers in the white boxes are not given any units. I'm still not sure if those numbers in the white box are a percentage or a debt value. Sometimes the visual art clutters the real message. It is important to make sure that recipient has the right frame of reference and can understand each graphic and label.
Sunday, June 26, 2011
Recommended Machine Learning blogs
Good Machine Learning blogs
Machine Learning is the scientific process of developing algorithms for computers to evolve based on empirical data. For instance one may develop a decision tree that helps predict a certain behavior from a data set. The decision tree itself is just a method to predict behavior. Yet perhaps more data can be acquired and more behaviors can be realized. Then the decision tree is computed again based on the newer data (and perhaps combined with the older). New behaviors are learned from the newer data and a new implementation of the decision tree is evolved for new behaviors. This process becomes algorithmic and continues.
Machine Learning developed out of the field of Artificial Intelligence. The idea of having computers learn has been around since as long as computers itself. Machine Learning is really starting to develop as computing power has caught up to theory. Machine Learning has a lot of uses and may be used by some of your favorite computer applications. Some examples include product recommendation systems like Amazon or Netflix, search engines like Google or Bing. Machine Learning is seeing practical uses in many places and its only just touching the surface.
Monday, June 20, 2011
Moneyball coming to the big screen
The story centers around Billy Beane which is played by Brad Pitt in the movie. Billy Beane is a professional ballplayer turned General Manager. Billy Beane inherits the top organizational management job for the losing Oakland Athletics. He is immediately frustrated with the same old losing ways and believes he needs to shake up the system. He finds out about the curious world of baseball analytics or otherwise know as sabermetrics and hires a curious crew of young mathematically gifted folks.
The story is fascinating even if you are not a fan of baseball. The use of mathematics to help make business decisions is nothing new. Yet employing this analytics method to an industry that is deep rooted in old ways and practices is intriguing. Changing the ways of the "good ole boy" network requires risk, knowledge, and sometimes good fortune. This can translate to almost any industry or even organization. I am most definitely looking forward to seeing this movie.
Thursday, June 16, 2011
OpenOpt Suite 0.34
I'm glad to inform you about new quarterly release 0.34 of our free OOSuite package software (OpenOpt, FuncDesigner, SpaceFuncs, DerApproximator) .
Main changes:
* Python 3 compatibility
* Lots of improvements and speedup for interval calculations
* Now interalg can obtain all solutions of nonlinear equation (example) or systems of them (example) in the involved box lb_i <= x_i <= ub_i (bounds can be very large), possibly constrained (e.g. sin(x) + cos(y+x) > 0.5 or [sin(i*x) + y/i < i for i in range(100)] )
* Many other improvements and speedup for interalg
Regards, D.
Monday, June 13, 2011
Analytics geeks win NBA championships
The analytics culture starts with Dallas Mavericks owner Mark Cuban. According to ESPN when Mark Cuban was looking for a coach he studied games and found out that Rick Carlisle used the most efficient lineups most frequently. Mark Cuban hiring Rick Carlisle to coach the Mavericks was a no-brainer because the numbers do not lie. As for Rick Carlisle, he is known for being a very cerebral coach and very handy with crunching NBA statistics as well.
Another known fact about the Dallas Mavericks is that they use an analytics staff to gain a competitive edge. Most recently they have retained the NBA analytics stat guru Roland Beech of 82games.com. In the past they had used the services of Wayne Winston, an Operations Research professor, to help analyze their lineups to be more competitive.
Mark Cuban gives a lot of attention to the geeks for Mavericks winning. From the ESPN article
I give a lot of credit to Coach Carlisle for putting Roland on the bench and interfacing with him, and making sure we understood exactly what was going on. Knowing what lineups work, what the issues were in terms of play calls and training.
That is a lot of brainpower on the bench in every game. It is good to see the geeks get their due. Way to go Mavericks and looking forward to seeing what the geeks put on the court next season!
Monday, May 23, 2011
Sports analytics summer blog reading recommendations
Baseball
FanGraphs.com
FanGraphs is the all everything baseball numbers website. The best thing that FanGraphs is known for is having a complete database of baseball players metrics. One of my favorite metrics in baseball is WAR or Wins Above Replacement. If that is not enough they even have heat maps of strike zone pitching locations. Tracking your favorite team has never been more analytically exciting.
Football
AdvancedNFLstats.com
Advanced NFL Stats is the best NFL analytics blog out there right now. Similar to FanGraphs there is a complete database of NFL offense and defense metrics. Advanced NFL Stats also does a good job of explaining the numbers behind the measurements. Football is no easy task to analyze team and player performance. This site does an excellent job of both. Also Advanced NFL Stats is keeping a database of play-by-play data.
Drive-by-Football
The up and comer of the NFL analytics blogs is Drive-By Football. Drive-By does a great job of explaining some of the harder math around determing team and player efficiency. One of the most interesting features is the Markov Chain Drive calculator which calculates likelihood of scoring scenarios drive-by-drive hence the name of the website.
Basketball
Wayne Winston blog
This blog's primary focus is on Basketball, specifically the NBA. Wayne Winston is definitely known as a prolific Operations Research professor. You may not know is that Wayne Winston consulted the Dallas Mavericks and other sports teams to help improve their franchises. Wayne talks about other sports from time to time as well. If you have not read Wayne Winston's book Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football you are in for an analytical treat. Wayne analyzes the why and how of measuring professional sports efficiency and winning.
Tuesday, May 17, 2011
In Memorium of Dr. Paul Jensen
I unfortunately did not know Dr. Jensen personally. I was first introduced to his ORMM website through my graduate courses at SMU. The ORMM website is a great resource to help educate the principles of Operations Research methods. I was also able to use some of his Excel modeling add-ons in practice to demonstrate optimization problems.
Dr. James Cochran is going to hold a special session in memorium of Dr. Jensen. This message from Dr. Cochran was sent on Dr. Jensen's ORMM mailing list.
Dear friends and colleagues,
Paul was a good friend and colleague. I know each of us will miss him (as will many other friends throughout the OR community) and each of us is very sorry for the loss suffered by Margaret and the rest of Paul's family.
I will chair a special INFORM-ED sponsored session in Paul's memory at the 2011 INFORMS Conference in Charlotte (November 13-16). Several of Paul's many friends will speak on his contributions to operations research education and share personal stories and remembrances about Paul. Margaret and Paul's children will be invited to attend, and I hope each of you will also be able to attend (I'll try to reserve some time at the end of the session during which members of the audience will have an opportunity to share their thoughts).
INFORMS Transactions on Education (the online journal for which I am Editor in Chief) will also publish a special issue devoted to Paul's influence on OR education. Dave Morton has kindly agreed to edit this special issue, so I am certain it will be a fine tribute to Paul.
Sincerely,
Jim
Monday, May 16, 2011
Welcome to the Insight Age
Does this sound familiar to anyone in Operations Research? It should because this is what Operations Research has been doing for years. I think I sound like a broken record sometimes. Yet I guess the story needs to be told again. But perhaps I'm being a little too snarky. It could just mean that the Information Age is catching up to the decision science analysts.
The crux of the article is technology meeting the demands of information overload. Yet that is not what the definition of insight is to me. Insight is drawing conclusions based on the evidence. The Operations Research analyst will undoubtedly be well prepared for this evolutionary advancement. I'm sure HP is aware that technology alone will not help the Insight Age revolution.
I hope we've all seen this new age coming. The Insight Age is here and is ready to be tackled. My next inclination is to think what will define the Insight Age. The Information Age was defined by the internet, computing power, and globalization. My prognostication to define the Insight Age is open data and decision science. Open data is about having no barriers to information. Data will be freely accessible and easy to disseminate. Decision science is already here and will make an even bigger impact. Machine Learning, Artificial Intelligence, Optimization Algorithms will all be the cogs of the Insight Age mechanism.
Insight Age is such a fitting name. I'm really liking it the more I think about it. I'm going to try to remember that in some of my future conversations.
Sunday, May 15, 2011
R Tutorial: Add confidence intervals to dotchart
R is my tool of choice for data visualization. My audience was a general audience so I didn't want to use boxplots or other density types of visualization methods. I wanted a simple mean and 95% (~ roughly 2 standard deviations) confidence around the mean. My method of choice was to use the dotchart function. Yet that function is limited to showing the data points and not the dispersion of the data. So I needed to layer in the confidence intervals.
The great thing about R is that the functions and objects are pretty much layered. I can create one R object and add to it as I see fit. This is mainly true with most plotting functions in R. I knew that I could use the lines function to add lines to an existing plot. This method worked great for my simplistic plot and adds another tool to my R toolbox.
Here is the example dotchart with confidence intervals R script using the "mtcars" dataset that is provided with any R installation.
x <- data.frame(mean=tapply(mtcars$mpg, list(mtcars$cyl), mean), sd=tapply(mtcars$mpg, list(mtcars$cyl), sd) )
### Add lower and upper levels of confidence intervals
x$LL <- x$mean-2*x$sd
x$UL <- x$mean+2*x$sd
### plot dotchart with confidence intervals
title <- "MPG by Num. of Cylinders with 95% Confidence Intervals"
dotchart(x$mean, col="blue", xlim=c(floor(min(x$LL)/10)*10, ceiling(max(x$UL)/10)*10), main=title )
for (i in 1:nrow(x)){
lines(x=c(x$LL[i],x$UL[i]), y=c(i,i))
}
grid()
And here is the example of the finished product.
Tuesday, May 3, 2011
Google funding research to measure regret
measure the distance between a desired outcome and the actual outcome, which can be interpreted as “virtual regret.”That sounds a lot like mathematical programming to me. So what is so different about the Tel Aviv Universtity findings? Apparently its not something new with the algorithms but more or less new with how the data is processed. Dr. Yishay Mansour explains that they will be using machine learning methodologies to look at all the relevant variables in advance of making informed decisions. This sounds more like this research is in the realm of how to understand large amounts of data and processing it into usable information.
Big data is a huge problem in the data rich but information lacking internet environment that we face today. There is a lot of data handled by organizations but they need to know what to do with it. Today's Operations Research professional should be perched to swoop in an help this issue. Organizations are data rich but lack the focus to apply it to meaningful decision analysis. I'm hoping that this is only going to lead to a big watershed moment for the Operations Research community.
Thursday, April 21, 2011
Open Source replacements for Operations Research and Analytics Software
Statistics and Computation
1. R Project
Replaces: SAS, SPSS
R is a free and open source statistical computing environment that holds its own against some of the most established proprietary statistical environments. R is available on all operating systems and is free for download. R also has a community driven library of add-on packages that are also freely available and cover almost any statistical, mathematical, or optimization need.
Also a great reference manual for those switching from SAS to R is SAS and R: Data Management, Statistical Analysis, and Graphics
2. RapidMiner
Replaces: KnowledgeSEEKER
RapidMiner is a data mining software with a graphical front-end. RapidMiner is suitable for most data mining and data transformation needs.
Mathematical Programming and Optimization
3. GLPK
Replaces: AMPL
GLPK is a GNU/free software linear programming software kit. GLPK is intended for large-scale linear programming, mixed integer programming. GLPK is based on GNU MathProg (or GMPL) which is considered a subset of the AMPL syntax. GLPK also has its own solver.
4. Symphony
Replaces: CPLEX, Gurobi
Symphony is a mixed integer linear programming solver developed under COIN-OR. Symphony is a flexible framework that offers many methods to customize solver capabilities given problem sets.
5. OpenSolver
Replaces: Excel Solver
OpenSolver is a linear an integer optimizer alternative to the Excel Solver in Microsoft Excel. OpenSolver is based on the COIN-OR CBC engine. Unlike the Excel Solver there is no software limits to the size of the problem that can be solved.
Tuesday, April 12, 2011
OReilly - Quiet Rise of Machine Learning
Yet to me what is interesting in this article is that implies that machine learning sciences is rising from basically nothingness. As if this is some sort of new fangled technology developed by IBM for a special man vs. machine Jeopardy act. I guess I'm a little too close to the Operations Research community to know where the roots really lie. For one I'm happy that decision sciences like machine learning are getting more and more recognition. On the other hand I'm thinking "Where have you been since World War II?". I guess I'm a little too cynical lately.
I love the OReilly Radar blog as it seems more and more articles are about the promise of data analytics. I guess I'm just wishing for a little more investigative reporting. In fact I think it would benefit INFORMS if they partnered with OReilly Media. OReilly definitely has a focus on analytics now and INFORMS is prime to provide a lot of great content for discussion.
Wednesday, March 30, 2011
Baseball and Decision Analytics
Baseball is definitely a numbers game. Mathematicians have been studying baseball for as long as the game itself has been played. One of the first notable baseball analysts to apply decision analysis was Bill James. Bill coined the study of baseball analysis as sabermetrics which is taken from the acronym of the Society of American Baseball Research. More recently baseball decision analysis has found its way to the Major League Baseball teams management offices. Popular books such as Moneyball by Michael Lewis and The Extra 2%: How Wall Street Strategies Took a Major League Baseball Team from Worst to First by Jonah Keri have shown how major league management turned around poor performaning clubs into championship contenders. The mathematics behind their decision analysis can be described best by Wayne Winston's book called Mathletics: How Gamblers, Managers, and Sports Enthusiasts Use Mathematics in Baseball, Basketball, and Football.
Baseball decision analysis has grown up since Bill James devised the batting average. Now baseball decision analysis uses techniques such as replacement value. The Value Over Replacement determines the value of a player given that player would be replaced by an average or run-of-the-mill at the given player's position. Value Over Replacement was made popular by Keith Woolner, the author of the Baseball Prospectus 2011. At first the value, which is usually offensive value, was to determine how many runs a player could produce over an average player. Now value over replacement methodologies determine how many wins a player can generate for their respective team. One of the best sites to give WAR analysis, or Wins Over Replacement, is Fangraphs. Fangraphs has about every major statistic on baseball available for the baseball enthusiast. In fact they even have heat maps for pitch location. Ready to manage your own team yet?
Pitch location heat map from Fangraphs.com |
Of course all of this decision analysis would not be possible without the numbers. One of the best places for baseball data is Baseball-Reference.com. Just about every data point on baseball can be mined from the site and downloaded. So if you have a craving to create your own baseball metric or analytics strategy there should be nothing stopping you.
This is another post in the INFORMS Online Blog Challenge. This month is O.R. and Sports.
Tuesday, March 22, 2011
R again in Google Summer of Code
The Google Summer of Code has really grown over the years. I'm glad to see that these open source initiatives really help teach our younger generation.
Wednesday, March 16, 2011
OpenOpt Suite release 0.33
New release 0.33 of OpenOpt Suite is out:
OpenOpt:
- cplex has been connected
- New global solver interalg with guarantied precision, competitor to LGO, BARON, MATLAB's intsolver and Direct (also can work in inexact mode), can work with non-Lipschitz and even some discontinuous functions
- New solver amsg2p for unconstrained medium-scaled NLP and NSP
FuncDesigner:
- Essential speedup for automatic differentiation when vector-variables are involved, for both dense and sparse cases
- Solving MINLP became available
- Add uncertainty analysis
- Add interval analysis
- Now you can solve systems of equations with automatic determination is the system linear or nonlinear (subjected to given set of free or fixed variables)
- FD Funcs min and max can work on lists of oofuns
- Bugfix for sparse SLE (system of linear equations), that slowed down computations and demanded more memory
- New oofuns angle, cross
- Using OpenOpt result(oovars) is available, also, start points with oovars() now can be assigned easier
SpaceFuncs (2D, 3D, N-dimensional geometric package with abilities for parametrized calculations, solving systems of geometric equations and numerical optimization with automatic differentiation):
- Some bugfixes
DerApproximator:
- Adjusted with some changes in FuncDesigner
For more details visit http://openopt.org.
Saturday, February 19, 2011
IBM has a Natural Language Purpose
Instead I came across an interesting editorial on the cheap publicity stunt that is IBM's Watson. At first I thought the article was a comedy that would make fun of Watson's errors on Jeopardy. Then I realized the author Colby Cosh is not jesting at all. This should not be news to me. The field of Operations Research, which was definitely used to help develop Watson, is a widely misunderstood field. Cosh has a hard time understanding why IBM would want to develop such a stunt to compete against humans. Cosh seems to think that the only gain is IBM's shareholders. I can assure you that if IBM wanted to make money on this venture they would have created a computer that would compete on American Idol. Jeopardy is no ratings juggernaut in the US.
So what purpose would IBM have for competing on Jeopardy. Perhaps the idea of "competition" is misleading. In my eyes I was not seeing if a computer can beat humans in a battle of wits. I was seeing if a device could interpret, process, and return meaningful information on the same level as human interpretation. Natural Language Processing is like code breaking. Similarly mathematics, physics, natural science are like codes to mathematicians, scientists, and engineers. It is the process of trying to decipher and interpret our natural surroundings. Language is no different. I can see it easy for Cosh to think that the sole idea of the competition is to beat humans. The purpose was simply to decipher the natural language code. In a better understanding of natural language we can then understand our surroundings a little better.
So why the hype with a computer?
"So why, one might ask, are we still throwing computer power at such tightly delimited tasks,..."The answer can be found already in the field of Operations Research and Management Science. Perhaps Cash has purchased a plane ticket in the past few years. He might have noticed that air transportation has become very affordable due to competitive pricing. A lot of that is due to optimization and revenue management algorithms in the airline industry. Perhaps he noticed the increase in quality, service, and price of privatized parcel postage. The science of better decision making and transportation algorithms have greatly improved supply chain and delivery efficiency. The list can go on and on. Artificial Intelligence is probably a poor way of describing computer optimization and machine learning science. Artificial Intelligence is not going to replace human intelligence but only help improve the human based decisions that we make every day. IBM has already stated that they wish to improve the medical field with Watson. Medical diagnosis requires vast amounts of information and Watson can help decipher medical journals, texts, and resources within seconds. Applications of Watson could be used in third world countries where medical resources are scarce.
I will be looking forward to IBM's advancement with Natural Language Processing. This offers a new venture into better decision sciences. Perhaps "smacking into the limits" of artificial intelligence will create a better life for those that use human intelligence every day.
Wednesday, February 16, 2011
Question and Answer sites for Analytics and Operations Research
Operations Research
OR-Exchange
This site is a stackexchange Q&A Site started by Michael Trick. A crowd sourcing question and answer resource for anything Operations Research related. I believe the best one available for Operations Research.
Numerical Optimization Forum
I find it unfortunate that there is so little forums based on Operations Research. This is one of the few and it is a good one. It is moderated by IEOR Tools contributor Dmitrey.
I'm purposely leaving out the sci.ops-research Usenet group because I believe its fallen into disarray with spam content.
Math/Statistics
Cross Validated
My favorite stack-exchange site dedicated to statistics.
Math
Math Overflow
Software
StackOverflow - R tag
StackOverlow - SQL tag
Mailing Lists
Mailing lists do not get as much notoriety as they once did. Maybe because there are so many other options on the internet for getting information. I still think they are a valuable resource and a good online community.
R Help Mailing List http://www.r-project.org/mail.html
GLPK Mailing List http://lists.gnu.org/mailman/listinfo/help-glpk
COIN-OR Mailing List(s) http://list.coin-or.org/mailman/listinfo/
Beta StackExchange Sites
These sites might be of interesting to the Operations Research community. They are not live yet but are looking to generate a following.
http://area51.stackexchange.com/proposals/28815/computational-science
http://area51.stackexchange.com/proposals/1907/numerical-modeling-and-simulation
http://area51.stackexchange.com/proposals/27706/engineering-and-applied-sciences
http://area51.stackexchange.com/proposals/26434/machine-learning
http://area51.stackexchange.com/proposals/24602/data-capture-analysis
http://area51.stackexchange.com/proposals/22964/sas-programming-language
http://area51.stackexchange.com/proposals/18584/engineering-and-scientific-software-tools
http://area51.stackexchange.com/proposals/15237/r-statistical-package
http://area51.stackexchange.com/proposals/9218/operations-research (Interesting. Do they know about OR-Exchange?)
I would love to see more examples that I can include in this list.
UPDATE:
I forgot to add O'Reilly's Q&A site with the R tag. http://answers.oreilly.com/tag/R
Friday, February 11, 2011
Science of Matchmaking
Mathematics of Matchmaking
I'm not sure I can cover all of the math behind the science of matchmaking. I thought it best to describe an example with the company Netflix. Netflix wants to make the decision process of selecting movies for its customers easier. Netflix developed an algorithm to match customers' interests in movies. In fact they even decided to farm out an improvement to the algorithm in a worldwide contest. So how does the Netflix algorithm work? There is a lot of math behind the algorithm but it essentially comes down to finding common features in the customer and movie data. The customers give Netflix a clue to the features they want by ranking movies the customers enjoy. This then becomes the dependent variables in the algorithm formulation. Then the algorithms churn out likely matches based on common feature sets.
Perhaps one of the best writings on this subject was given by Simon Funk on his blog about his Netflix Contest adventures. Simon thought a creative way to find features would be to use the matrix transformation process of Singular Value Decomposition. Traditionally Singular Value Decomposition was used in the microelectronics industry to improve digital signal processing. Simon wrote up an easy solution for matchmaking movie features with the SVD method which spurned a wave of enthusiasm in the Netflix Contest entrants.
Finding feature sets is not exclusively in the realm of Linear Algebra. There are also methods of clustering, regression, support vector machine, neural networks, bayesian networks, and decision trees just to name a few. The science of matchmaking is closely related to artificial intelligence and is commonly referred to machine learning. Machine learning is using algorithms and mathematical methods to evolve and generate behaviors from data in order to solve a problem.
Processing the Matchmaking Data
The science of matchmaking would not be complete without the data. The advent of the internet has opened a lot of new enterprises that makes use of millions of data observations. These internet companies have a lot of data to process in huge server arrays that will make even the ENIAC envious. So how do these companies process all of this matchmaking data with their matchmaking algorithms? The basic answer is to break it down into manageable chunks. Perhaps no greater example is Google and their MapReduce methods. MapReduce is a software framework process that takes a large computing need and breaks it down into a distributed network that is more manageable. The first step in the MapReduce process is to Map the data. The Map process is to organize and distribute the data to computing nodes, usually a huge cluster. The Reduce process is to apply the algorithm or learning process to a node in the network and determine an answer to the data its given. This essentially gives it a local optimum. This process is iterated until a globally learned optimum is achieved. This is a very cut and dry description but you get the idea.
The MapReduce software framework is proprietary to Google. That has not stopped software enthusiasts. An open source MapReduce method was created called Hadoop and is growing into a stronger user supported community.
So what can be used with the science of matchmaking? Really anything the heart desires (okay, again, that was bad). Amazon.com uses recommendation algorithms for its books and products. Online dating sites (how appropriate) uses matchmaking methods for matching interested daters. Search engines like Google uses matchmaking algorithms, known as PageRank, to match search keywords with websites. As you can tell these types of enterprises are doing very well thanks to the science of matchmaking.
This article is part of the INFORMS Online blog challenge. February's blog challenge is Operations Research and Love.
Wednesday, February 9, 2011
Data Mining Books List
I believe this is probably one the the most comprehensive lists of Data Mining books available. If you are interested in obtaining one of these books please be sure to peruse the the new IEOR Tools Online Store Data Mining section. There you can find books and references on Data Mining with varying levels from introductory to advanced applications.
Thursday, January 27, 2011
Operations Research and gerrymandering
One area of politics is the topic of gerrymandering. For those that are not familiar to gerrymandering it is the process of resetting electoral boundaries for voting purposes. Gerrymandering is a hot topic around any election because it is usually the party in power controls the rights to reset the electoral boundaries. This leads to an obvious advantage to the party as they can maintain a seat in a legislature with setting boundaries based on past voting behavior.
Operations Research can be a valuable asset to the process of redistricting. In fact Operations Research has been very much involved in redistricting for at least 50 years. Decisions to draw electoral lines can follow any number of constructions including demographics, population centers, municipality boundaries or industry types. As information abounds more freely there is more opportunity to use it for decision making. It seems every new census brings more available data. The growth of the internet has allowed information to be available more openly. Opportunities should grow in Operations Research to provide redistricting decision makers the information for informed analysis.
Perhaps one of the better uses of Operations Research could be the ethical context of the gerrymandering debate. I have often heard it debated that Operations Research may have created the politically polarized country we have today. The same tools of Operations Research could be used to allow transparency in the redistricting process. It could be useful for citizens to know how probable or likely outcomes of elections based on redistricting suggestions. Websites like OpenSecrets.org shows how money influences party affiliation and elections. Perhaps similar websites can emerge on electoral districts and the legislation that helped create them.
The debate of gerrymandering will last for centuries I am sure. I believe Operations Research can play a vital part in the debate. Information is more open and easy to access than ever. Let's use that to our best ability and help inform the voting electorate.