Maximize Productivity with Industrial Engineer and Operations Research Tools: Visualizing categorical data in R

Wednesday, December 21, 2011

Visualizing categorical data in R

I came across an interesting SAS macro that was used for visualizing log odds relationships of data. This type of chart is helpful for visualizing the relationship between a binary dependent variable and a continuous independent variable. I don't use SAS on a daily basis as I prefer to use R. So I got to thinking that I could recreate this macro using only R. I thought this would be a good tutorial for R on developing functions, using different plot techniques, and overlapping chart types.

The following picture is the result of the logodds function in R. The chart is really close but not quite exact. For the histogram points I decided to use the default squares of the stripchart plot and used a grey color to make it look a little faded.

The following is the R script.

logoddsFnc <- function(data_ind, data_dep, ind_varname, min.count=1){

# Assumptions: x & y are numeric vectors of the same
# length, y is 0/1 varible. This returns a vector
# of breaks of the x variable where each bin has at
# least min.countnumber of y's
bin.by.other.count <- function(x, other, min.cnt=1) {
    csum <- cumsum(tapply(other, x, sum))
    breaks <- numeric(0)

    i <- 1
    breaks[i] <- as.numeric(names(csum)[1])
    cursum <- csum[1]

    for ( a in names(csum) ) {
      if ( csum[a] - cursum >= min.cnt ) {
        i <- i + 1
        breaks[i] <- as.numeric(a)
        cursum <- csum[a]
      }
    }

    breaks
}

brks <- bin.by.other.count(data_ind, data_dep, min.cnt=min.count)

# Visualizing binary categorical data
var_cut <- cut(data_ind, breaks=brks, include.lowest=T)
var_mean <- tapply(data_dep, var_cut, mean)
var_median <- tapply(data_ind, var_cut, median)

mydf <- data.frame(ind=data_ind, dep=data_dep)
fit <- glm(dep ~ ind, data=mydf, family=binomial())
pred <- predict(fit, data.frame(ind=min(data_ind):max(data_ind)),
        type="response", se.fit=T)

# Plot
plot(x=var_median, y=var_mean, ylim=c(0,1.15),
       xlab=ind_varname, ylab="Exp Prob", pch=21, bg="black")
stripchart(data_ind[data_dep==0], method="stack",
             at=0, add=T, col="grey")
stripchart(data_ind[data_dep==1], method="stack",
             at=1, add=T, col="grey")

lines(x=min(data_ind):max(data_ind),
        y=pred$fit, col="blue", lwd=2)
lines(lowess(x=var_median,
               y=var_mean, f=.30), col="red")

lines(x=min(data_ind):max(data_ind),
        y=pred$fit - 1.96*pred$se.fit, lty=2, col="blue")
lines(x=min(data_ind):max(data_ind),
        y=pred$fit + 1.96*pred$se.fit, lty=2, col="blue")
}

logoddsFnc(icu$age, icu$died, "age", min.count=3)

The ICU data for this example can be found in the R package "vcdExtra". Special thanks to David of Univ. of Dallas for providing me with a way to develop breaks in the independent variable as seen by the bin.by.other.count function.

The author of the SAS macro is also the author of Visualizing Categorical Data by M. Friendly which is a great reference for analyzing and visualizing data in factored groups.

5 comments:

Marti said...: hi, thanks so much! It's interesting! One question: If I want to try it I need a package if want to obtain the data "e.g: data_ind). Isnt't it?

thanks in advance.

Martí Casals.; December 22, 2011 at 2:44 AM
larrydag said...: Yes you can get the data from the R package "vcdExtra". The data is named ICU. Also you can get the data from the Michael Friendly's VCD website but its not as clean and will require manipulation.
http://www.datavis.ca/sas/vcd/catdata/icu.sas; December 22, 2011 at 8:54 PM
Abdiel Technologies said...: http://www.Hound.com is a good source of jobs because it only shows you jobs from employer websites and every other job board out there. www.Hound.com is a job website that show its member jobs for every employment website through out the world. In this website you can search jobs on the basis of specific locations and practice areas. The site has the highest number of jobs in the world.
www.Hound.com is a good source of jobs because it only shows you jobs from employer websites and every other job board out there.
operations analyst jobs; January 9, 2012 at 10:16 PM
Jyoti said...: It was good reading this great.
failed to find some thing about static mixers.; January 24, 2012 at 1:03 AM
Assignment Help said...: Great Blog with useful information.
Experience the Best Assignment Writing Services at
Science Assignment Help; December 13, 2019 at 12:38 AM