Tuesday, December 7, 2010

Big Data Logistic Regression with R and ODBC

Recently I've been doing a lot of work with predictive models using logistic regression.  Logistic regression is great for determing probable outcomes of a independent binary target variable.  R is a great tool for accomplishing this task.  Often times I will use the base function glm to develop a model.  Yet there are times, due to the hardware or software memory restrictions, that the usual glm function is not enough get the job done.

A great alternative to performing usual logistic regression analyses on big data is using the biglm package.  Biglm performs the same regression optimization but processes the data in "chunks" at a time.  This allows R to only perform calculations on smaller data sets without the need for large memory allocations to the computer.  Biglm also has an interesting option that it not only can perform calculations on imported dataframes and text files but also database connectivity.  This is where the helpful package RODBC comes in to the aid.

I have be looking all over the R support lists and blogs in hopes of finding a good tutorial using biglm and RODBC.  I was not successful yet I was able to find out how to perform this myself.

The first step is to establish an ODBC source to a database.  In this example I am using a Windows OS environment and connecting to a MS SQL Server.  An odbc source must first be setup on the computer.  This is usually done in the Windows Control Panel.  Once that is done then RODBC can be used to establish a connection.  My example is an odbc data source name called "sqlserver".

myconn <- odbcConnect(sqlserver)

Now an ODBC connection object is established.  Queries can now be submitted to the SQL Server via the sqlQuery function which is what we will use as the data source.  SQL scripts can be the typical select statements.

sqlqry <- "select myvars, targetvar from mytable"

Next is to use the bigglm function to perform the logistic regression.

fit <- bigglm(targetvar ~ myvars, data=sqlQuery(myconn, sqlqry), family=binomial(), chunksize=100, maxit=10)

The data is being pulled from the SQL Server via the sqlQuery function from the RODBC package.  The bigglm will recognize the sqlQuery data as a dataframe.  The chunksize specifies the number of lines to process at any time.  The maxit value specifies the maximum number of Fisher scoring iterations.  

This big data method is not difficult once the SQL Server connection is setup.  You will notice that R will not have the memory limitations while performing logistic regressions.  For more information on regression modeling I recommend getting Applied logistic regression (Wiley Series in probability and statistics).  This book has been an essential for statistics and applied regression practitioners.  Other good resources are Logistic Regression Models (Chapman & Hall/CRC Texts in Statistical Science) and Modern Regression Techniques Using R: A Practical Guide.


Anonymous said...

I've wondered about this restriction in the doc, "Factors are permitted, but the levels of the factor must be the same across all data chunks (empty factor levels are ok)."

How can you guarantee that a given chunk size will see all levels of each factor, or does the query first execute a "distinct" on each factor to allow empty levels in any chunk?

Larry said...

That's a good question and does require some thought on how the data is prepared. I'm wondering if there are a lot of levels in the data that some pre-processing might be done to specify factors for each categorical variable. This is a good point you bring up.

Will Dwinnell said...

I'm curious as to how large a data set you'd consider running through logistic regression in R, and how large, through Biglm?

Larry said...

Will, its really hard to say. I can see very large data sets being applied with RODBC and Biglm. I know a data set I'm considering is 300,000+ observations with 500+ variables. I could suspect this method can go beyond that size. Of course it could be computationally expensive and take a while to run.

Louis Rossouw said...

The way you've written it the data will still be read into R in full and not in chunks. This still saves memory in the calculation of the glm but does not solve the problem if your data is larger than the available memory.

To fix this you need to pass the odbc connection object as data and specify the tablename="SQL tablename". In this process I have however discovered a bug in that the bigglm code uses sql syntax not compatable with MS SQL.

Larry D'Agostino said...

Thanks Louis. That is great feedback. I'll be sure to play around with that to optimize the script.

Briana Abrahms said...

Do you know how/if it possible to do a CONDITIONAL logistic regression with biglm, which uses matched case-controls such as with the function clogit in the Survival package? My data is set up for clogit in that I have all the 0s corresponding to a 1 identified as different strata. Would appreciate any thoughts on this!