Tuesday, June 29, 2010

Kaggle hosting INFORMS 2010 Data Mining Contest

Kaggle is hosting the 2010 INFORMS Data Mining Contest.  The goal of this years INFORMS Data Mining Contest is to predict intra-day stock price movements.  All data and submission guidelines are provided on the Kaggle website.  Entries that are submitted are immediately scored and evaluated by an AUC calculation.  The leading AUC score by the end of the contest is going to be honored as the annual INFORMS meeting which is in Austin, Texas (Nov. 7-10).

There is already a lot of good discussions of modeling techniques.  Mark started off with a question on OR-Exchange about modeling methods for the INFORMS contest.   Since the data is a binary categorical target his preferred method was using Logistic Regression.  Mark provides example R code to provide collaborative input to the contest.  I followed suit and provided an IEORTools entry to the contest.  I used the same methods of Logistic Regression.  I also did some variable analysis using the rpart package in R to develop a decision tree.  After pulling some variables that were not significant I was able to get on the leaderboard with Mark.  The pictured leaderboard is of June 28. 

There is also some good discussion on the Kaggle website contest forum.  Posted on the forum one entrant suggested possible variables to use in a Logistic Regression model which is very beneficial.

I really like to see this collaborate effort to modeling.  This was one of the qualities I really enjoyed in the Netflix Prize.  I hope Kaggle and INFORMS continues to provide these fun and thought provoking contests.

No comments: