Saturday, February 19, 2011

IBM has a Natural Language Purpose

I wanted to write a blog post about the advancements of Natural Language Processing in light of the performance of IBM's Watson on the Jeopardy challenge last week.  Natural Language Processing is the science of transforming and interpreting human spoken and written language by artificial means.  Generally this type of study has been limited to academic research due to the high computing power demands.  Now there are even open source software implementations, including many R Natural Language Processing packages.  There is a lot to write about the new advances in NLP.

Instead I came across an interesting editorial on the cheap publicity stunt that is IBM's Watson.  At first I thought the article was a comedy that would make fun of Watson's errors on Jeopardy.  Then I realized the author Colby Cosh is not jesting at all.  This should not be news to me.  The field of Operations Research, which was definitely used to help develop Watson, is a widely misunderstood field.  Cosh has a hard time understanding why IBM would want to develop such a stunt to compete against humans.  Cosh seems to think that the only gain is IBM's shareholders.  I can assure you that if IBM wanted to make money on this venture they would have created a computer that would compete on American Idol.  Jeopardy is no ratings juggernaut in the US.

So what purpose would IBM have for competing on Jeopardy.  Perhaps the idea of "competition" is misleading.  In my eyes I was not seeing if a computer can beat humans in a battle of wits.  I was seeing if a device could interpret, process, and return meaningful information on the same level as human interpretation.  Natural Language Processing is like code breaking.  Similarly mathematics, physics, natural science are like codes to mathematicians, scientists, and engineers.  It is the process of trying to decipher and interpret our natural surroundings.  Language is no different.  I can see it easy for Cosh to think that the sole idea of the competition is to beat humans.  The purpose was simply to decipher the natural language code.  In a better understanding of natural language we can then understand our surroundings a little better.

So why the hype with a computer?
"So why, one might ask, are we still throwing computer power at such tightly delimited tasks,..."
The answer can be found already in the field of Operations Research and Management Science.  Perhaps Cash has purchased a plane ticket in the past few years.  He might have noticed that air transportation has become very affordable due to competitive pricing.  A lot of that is due to optimization and revenue management algorithms in the airline industry.  Perhaps he noticed the increase in quality, service, and price of privatized parcel postage.  The science of better decision making and transportation algorithms have greatly improved supply chain and delivery efficiency.  The list can go on and on.  Artificial Intelligence is probably a poor way of describing computer optimization and machine learning science.  Artificial Intelligence is not going to replace human intelligence but only help improve the human based decisions that we make every day.  IBM has already stated that they wish to improve the medical field with Watson.  Medical diagnosis requires vast amounts of information and Watson can help decipher medical journals, texts, and resources within seconds.  Applications of Watson could be used in third world countries where medical resources are scarce.

I will be looking forward to IBM's advancement with Natural Language Processing.  This offers a new venture into better decision sciences.  Perhaps "smacking into the limits" of artificial intelligence will create a better life for those that use human intelligence every day.

Wednesday, February 16, 2011

Question and Answer sites for Analytics and Operations Research

This post is inspired from a similar post on Jeremy Anglin's blog about statistics question and answer sites.  I thought it would be a good idea to list some of the Operations Research and Analytics focused question and answer sites.  Some of these

Operations Research
This site is a stackexchange Q&A Site started by Michael Trick.  A crowd sourcing question and answer resource for anything Operations Research related.  I believe the best one available for Operations Research.

Numerical Optimization Forum
I find it unfortunate that there is so little forums based on Operations Research.  This is one of the few and it is a good one.  It is moderated by IEOR Tools contributor Dmitrey.

I'm purposely leaving out the sci.ops-research Usenet group because I believe its fallen into disarray with spam content.

Cross Validated
My favorite stack-exchange site dedicated to statistics.

Math Overflow  

StackOverflow - R tag    
StackOverlow - SQL tag 

Mailing Lists
Mailing lists do not get as much notoriety as they once did.  Maybe because there are so many other options on the internet for getting information.  I still think they are a valuable resource and a good online community.

R Help Mailing List
GLPK Mailing List
COIN-OR Mailing List(s)

Beta StackExchange Sites
These sites might be of interesting to the Operations Research community.  They are not live yet but are looking to generate a following. (Interesting.  Do they know about OR-Exchange?)

I would love to see more examples that I can include in this list.


I forgot to add O'Reilly's Q&A site with the R tag.

Friday, February 11, 2011

Science of Matchmaking

The science of matchmaking has seen serious growth in the last few years.  What exactly is so scientific about matchmaking anyway?  The goal of any commercial enterprise (and some public organizations) is to match products or services to the demand of consumers.  The idea of matching consumers with products and services is not new.  Matchmaking is essentially the business art of Marketing.  The science behind the matchmaking has seen the most advancement and improvement in recent time.  Generally speaking computing power has made the difference for the technologically leap forward.  Millions of points of observations and data can be sifted and combed with great ease as compared to even just a decade ago.  There is a scientific magic fiddling to the matchmaking phenomenon (sorry, bad Fiddler On The Roof pun). 

Mathematics of Matchmaking

I'm not sure I can cover all of the math behind the science of matchmaking.  I thought it best to describe an example with the company Netflix.  Netflix wants to make the decision process of selecting movies for its customers easier.  Netflix developed an algorithm to match customers' interests in movies.  In fact they even decided to farm out an improvement to the algorithm in a worldwide contest.  So how does the Netflix algorithm work?  There is a lot of math behind the algorithm but it essentially comes down to finding common features in the customer and movie data.  The customers give Netflix a clue to the features they want by ranking movies the customers enjoy.  This then becomes the dependent variables in the algorithm formulation.  Then the algorithms churn out likely matches based on common feature sets. 

Perhaps one of the best writings on this subject was given by Simon Funk on his blog about his Netflix Contest adventures.  Simon thought a creative way to find features would be to use the matrix transformation process of Singular Value Decomposition.  Traditionally Singular Value Decomposition was used in the microelectronics industry to improve digital signal processing.  Simon wrote up an easy solution for matchmaking movie features with the SVD method which spurned a wave of enthusiasm in the Netflix Contest entrants.

Finding feature sets is not exclusively in the realm of Linear Algebra.  There are also methods of clustering, regression, support vector machine, neural networks, bayesian networks, and decision trees just to name a few.  The science of matchmaking is closely related to artificial intelligence and is commonly referred to machine learning.  Machine learning is using algorithms and mathematical methods to evolve and generate behaviors from data in order to solve a problem.

Processing the Matchmaking Data

The science of matchmaking would not be complete without the data.  The advent of the internet has opened a lot of new enterprises that makes use of millions of data observations.  These internet companies have a lot of data to process in huge server arrays that will make even the ENIAC envious.  So how do these companies process all of this matchmaking data with their matchmaking algorithms?  The basic answer is to break it down into manageable chunks.  Perhaps no greater example is Google and their MapReduce methods.  MapReduce is a software framework process that takes a large computing need and breaks it down into a distributed network that is more manageable.  The first step in the MapReduce process is to Map the data.  The Map process is to organize and distribute the data to computing nodes, usually a huge cluster.  The Reduce process is to apply the algorithm or learning process to a node in the network and determine an answer to the data its given.  This essentially gives it a local optimum.  This process is iterated until a globally learned optimum is achieved.  This is a very cut and dry description but you get the idea.

The MapReduce software framework is proprietary to Google.  That has not stopped software enthusiasts.  An open source MapReduce method was created called Hadoop and is growing into a stronger user supported community.

So what can be used with the science of matchmaking?  Really anything the heart desires (okay, again, that was bad). uses recommendation algorithms for its books and products.   Online dating sites (how appropriate) uses matchmaking methods for matching interested daters.  Search engines like Google uses matchmaking algorithms, known as PageRank, to match search keywords with websites.  As you can tell these types of enterprises are doing very well thanks to the science of matchmaking.

This article is part of the INFORMS Online blog challenge.  February's blog challenge is Operations Research and Love.

Wednesday, February 9, 2011

Data Mining Books List

I came across a great list of Data Mining Books while perusing around the internet.  The list is maintained by Kurt Thearling who is Director and Chief Scientist in various organizations helping to develop their Analytics and Engineering groups.  Kurt has written some white papers on the subject of Data Mining and has also been featured on NPR.  Kurt's NPR piece was about data mining and privacy which is obviously a big subject in our Facebook society today.

I believe this is probably one the the most comprehensive lists of Data Mining books available.  If you are interested in obtaining one of these books please be sure to peruse the the new IEOR Tools Online Store Data Mining section.  There you can find books and references on Data Mining with varying levels from introductory to advanced applications.