- Open Source Replacements for Operations Research and Analytics Software
- R Tutorial: Add confidence intervals to a dot chart
- Science of Matchmaking
- Data Mining Books list
- Physicist cuts airplane boarding time in half
- R again in Google Summer of Code
- Moneyball coming to the big screen
- Baseball and Decision Analytics
- Question and Answer sites for Analytics and Operations Research
- Sports analytics summer blog reading recommendations
Friday, December 30, 2011
Most popular 2011 IEOR Tools blog articles
The most popular IEOR Tools blog articles of 2011. It is time for reflection and I like to do this every year. It gives me perspective about what is being read. It is also an interesting look at our interests. This year seems to be about our thirst for software tools and how to use them. Also books are still big for reference materials.
Wednesday, December 21, 2011
Visualizing categorical data in R
I came across an interesting SAS macro that was used for visualizing log odds relationships of data. This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable. I don't use SAS on a daily basis as I prefer to use R. So I got to thinking that I could recreate this macro using only R. I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.
The following picture is the result of the logodds function in R. The chart is really close but not quite exact. For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.
The following is the R script.
logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){
# Assumptions: x & y are numeric vectors of the same
# length, y is 0/1 varible. This returns a vector
# of breaks of the x variable where each bin has at
# least min.countnumber of y's
bin.by.other.count <- function(x, other, min.cnt=1) {
csum <- cumsum(tapply(other, x, sum))
breaks <- numeric(0)
i <- 1
breaks[i] <- as.numeric(names(csum)[1])
cursum <- csum[1]
for ( a in names(csum) ) {
if ( csum[a] - cursum >= min.cnt ) {
i <- i + 1
breaks[i] <- as.numeric(a)
cursum <- csum[a]
}
}
breaks
}
brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
# Visualizing binary categorical data
var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
var_mean <- tapply(data_dep, var_cut, mean)
var_median <- tapply(data_ind, var_cut, median)
mydf <- data.frame(ind=data_ind, dep=data_dep)
fit <- glm(dep ~ ind, data=mydf, family=binomial())
pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
type="response", se.fit=T)
# Plot
plot(x=var_median, y=var_mean, ylim=c(0,1.15),
xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
stripchart(data_ind[data_dep==0], method="stack",
at=0, add=T, col="grey")
stripchart(data_ind[data_dep==1], method="stack",
at=1, add=T, col="grey")
lines(x=min(data_ind):max(data_ind),
y=pred$fit, col="blue", lwd=2)
lines(lowess(x=var_median,
y=var_mean, f=.30), col="red")
lines(x=min(data_ind):max(data_ind),
y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
lines(x=min(data_ind):max(data_ind),
y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}
logoddsFnc(icu$age, icu$died, "age", min.count=3)
The ICU data for this example can be found in the R package "vcdExtra". Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function.
The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.
The following picture is the result of the logodds function in R. The chart is really close but not quite exact. For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.
The following is the R script.
logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){
# Assumptions: x & y are numeric vectors of the same
# length, y is 0/1 varible. This returns a vector
# of breaks of the x variable where each bin has at
# least min.countnumber of y's
bin.by.other.count <- function(x, other, min.cnt=1) {
csum <- cumsum(tapply(other, x, sum))
breaks <- numeric(0)
i <- 1
breaks[i] <- as.numeric(names(csum)[1])
cursum <- csum[1]
for ( a in names(csum) ) {
if ( csum[a] - cursum >= min.cnt ) {
i <- i + 1
breaks[i] <- as.numeric(a)
cursum <- csum[a]
}
}
breaks
}
brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)
# Visualizing binary categorical data
var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
var_mean <- tapply(data_dep, var_cut, mean)
var_median <- tapply(data_ind, var_cut, median)
mydf <- data.frame(ind=data_ind, dep=data_dep)
fit <- glm(dep ~ ind, data=mydf, family=binomial())
pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
type="response", se.fit=T)
# Plot
plot(x=var_median, y=var_mean, ylim=c(0,1.15),
xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
stripchart(data_ind[data_dep==0], method="stack",
at=0, add=T, col="grey")
stripchart(data_ind[data_dep==1], method="stack",
at=1, add=T, col="grey")
lines(x=min(data_ind):max(data_ind),
y=pred$fit, col="blue", lwd=2)
lines(lowess(x=var_median,
y=var_mean, f=.30), col="red")
lines(x=min(data_ind):max(data_ind),
y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
lines(x=min(data_ind):max(data_ind),
y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}
logoddsFnc(icu$age, icu$died, "age", min.count=3)
The ICU data for this example can be found in the R package "vcdExtra". Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function.
The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.
Thursday, December 15, 2011
OpenOpt Suite 0.37
Hi all,
I'm glad to inform you about new release 0.37 (2011-Dec-15) of our free software:
OpenOpt (numerical optimization):
- IPOPT initialization time gap (time till first iteration) for FuncDesigner models has been decreased
- Some improvements and bugfixes for interalg, especially for "search all SNLE solutions" mode (Systems of Non Linear Equations)
- Eigenvalue problems (EIG) (in both OpenOpt and FuncDesigner)
- Equality constraints for GLP (global) solver de
- Some changes for goldenSection ftol stop criterion
FuncDesigner:
- Major sparse Automatic differentiation improvements for badly-vectorized or unvectorized problems with lots of constraints (except of box bounds); some problems now work many times or orders faster (of course not faster than vectorized problems with insufficient number of variable arrays). It is recommended to retest your large-scale problems with useSparse = 'auto' | True| False
- Two new methods for splines to check their quality: plot and residual
- Solving ODE dy/dt = f(t) with specifiable accuracy by interalg
- Speedup for solving 1-dimensional IP by interalg
SpaceFuncs and DerApproximator:
- Some code cleanup
You may trace OpenOpt development information in our recently created entries in Twitter and Facebook, see http://openopt.org for details.
See also: FuturePlans, this release announcement in OpenOpt forum
Regards, Dmitrey.
Expanded list of online courses for data analysis
The folks at Stanford have been really busy putting together online curriculum for the world to learn. This is a followup to a previous post Machine Learning for everyone. Stanford has included a bunch of other courses that they will promote online.
Some of the interesting courses are
As an analytics professional for many years I've found honing your skills to be very important for your career. Now more than ever it is easier to do with schools opening up their classes for everyone. I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.
Some of the interesting courses are
- Model Thinking
- Natural Language Processing
- Game Theory
- Design and Analysis of Algorithms
- Introduction to Linear Dynamical Systems
- Convex Optimization I
- Convex Optimization II
As an analytics professional for many years I've found honing your skills to be very important for your career. Now more than ever it is easier to do with schools opening up their classes for everyone. I strongly recommend finding areas of expertise that you are passionate about or want to learn and find the schools that promote them online.
Subscribe to:
Posts (Atom)