Thursday, November 10, 2011

My learning as a Data Scientist

So apparently the new en-vogue title is Data Scientist.  I can now include that to my already expanding list of titles.  In the past I've been known as an Engineer, Operations Analyst, Production Control Specialist, and an Analytics Analyst.  Now I'm considered a Data Scientist.  It's all the same to me.  My training and expertise has allowed me to problem solve many challenges within organizations.  The title doesn't matter.  There are opportunities for people with my skill set.

A recent blog post by Kontagent Kaleidoscope about Big Data is Useless without Science got me thinking about my role as a self-proclaimed Data Scientist.  The blog article points out a need for the science of better decision making.  Organizations are looking for people to help them turn their data mines into information gold.  I've definitely learned a lot over the years as a Data Scientist and I thought I would list some of those learnings.

1.  Organizations Don't Know What a Data Scientist Can Do

The idea here is marketing your own talents.  The Data Scientist needs to put their methods and work out there for the organization to see and touch.  This means working with the peers and management in the organization.  The Data Scientist needs to be able to eloquently relate methods, problems, challenges and how they can be solved.  Important skills here are personal marketing and communication.  I know this goes against the grain of many numbers geeks like me.

2.  Problems Don't Solve Themselves

Opportunities for solving real problems in an organization are always around.  The trick is being in the right place at the right time to be able to solve those problems.  Organizations have hoarded a lot of data and many times they don't even remember why.  The Data Scientist needs to turn into a Data Detective.  Explore all aspects of the organization.  Interview different departments and see how they tick and ask questions like "What keeps you up at night about your job?".  I was often surprised how a simple solution would go a long way to helping someone else out.  This develops true collaboration and leads to bigger problems to solve.

3.  Always Continue to Learn New Things

The world is constantly evolving and there are always new tools, tricks, methods, algorithms, software and mechanisms.  The Data Scientist needs to be able to adapt to new technologies.  I've found its best to stay current with whats new in order to stay sharp and meet new demands.  The internet can be your friend.  Even keeping up with a favorite list of blogs can help with staying current.  Times change and so do organization's needs.  Perhaps this is just me but I love learning new things as it creates a fun diversion and improves my skill sets.



Tuesday, October 11, 2011

Top 50 Statistics blogs of 2011

TheBestColleges.org published their list of the top 50 statistics blogs.  This is a really good resource list of statistical analysis and news.

Monday, October 3, 2011

Data mining the Federal Reserve

The Federal Reserve now has the ability to have its data programmatically retrieved.  The St. Louis Fed Web Services allows programmers and data scientists to retrieve key economic data from their libraries.  I have not had a chance to peruse the site at all but this can be a really interesting source of data.  The age of Open Data is really upon us.  This can lead to some really interesting research for professionals and amateur scientists.

Monday, September 26, 2011

Machine Learning for everyone

Well maybe mostly everyone. Have  you been interested in gaining knowledge in the latest craze of artificial intelligence and computing?  Then go no further than Stanford's Machine Learning course which is now open enrollment to everyone!  Andrew Ng is back to provide the world with knowledge about Machine Learning for the entire masses. 

Per Stanford's website, Machine Learning is data mining and statistical pattern recognition.  Mostly it is applying mathematical and statistical methods to draw out information behaviors from data sources.  So do you want to invent the next Netflix, Amazon or Google?  This is the course for you.

If you do not want to enroll in the Machine Learning class you could always watch some of the older lectures online.  Andrew Ng provides plenty of information from past lectures with student contributed projects.  The CS 229 website is worth a look for a punch of Machine Learning related resources.

Friday, September 23, 2011

Data Driven Success in Professional Baseball

An interesting article from Data Center Knowledge about the presentation Paul DePodesta gave at the Strata Summit.  Paul DePodesta is known for bringing mathematic and analytical know-how to Billy Beane and the major league professional baseball team Oakland Athletics.  His story was accounted by Michael Lewis in the book "Moneyball" and is being portrayed with the same name on the big screen opening this weekend.

I really liked this quote from Paul in the article.

We didn’t solve baseball. But we reduced the inefficiency of our decision making.
Is that not the sort of things that an analytical professional or an Operations Researcher ultimately tries to do?   Operations Research is not the art of creating anything new.  It is the art of creating existing things better.  All decision making is inefficient to some point.  Even the right decision can be inefficient on some level.  Decisions are full of balancing acts of constraints and feasibility.

Also this proves that no industry or organization is absent of a need for efficient decision making.  Even baseball can us a dose of improved decision analysis.  Whether is scheduling the league or determining the best pitcher for their value.  Sports has definitely come into their own with decision analytics.  I'm eager to watch Paul's career and wonder if analytics is taking it to the next level.

Thursday, September 15, 2011

OpenOpt Suite 0.36

New release of the free BSD-licensed software OpenOpt Suite is out:

OpenOpt:

* Now solver interalg can handle all types of constraints and integration problems

* Some minor improvements and code cleanup

FuncDesigner:

* Interval analysis now can involve min, max and 1-d monotone splines R -> R of 1st and 3rd order

* Some bugfixes and improvements

SpaceFuncs:

* Some minor changes

DerApproximator:

* Some improvements for obtaining derivatives in points from R^n where left or right derivative for a variable is absent, especially for stencil > 1

See http://openopt.org for more details.


Wednesday, August 31, 2011

Physicist cuts airplane boarding time in half

I have always been fascinated with the airplane boarding problem.  I wish I was in the airline industry because I would love to tackle this problem.  I used to travel a lot for my job and I would marvel at how inefficient the time  it took to board an airliner.  My first inclination is to redesign the plane (and the airport jetway) to include exits at middle and rear of the plane to go along with the forward exit.  Yet I never put my ideas to paper and tried to calculate efficiency gains.  There have been a lot of ideas try to find the optimal boarding arrangement.  Would you believe that random boarding, i.e. Southwest Airlines, is a more optimal boarding procedure then the current row assignment method?

Yet a physicist from Fermilab, Jason Steffen, did have some interesting ideas to improve the existing airplane boarding procedures.  By using Monte Carlo simulations to measure efficiency and test his ideas he was able to improve airplane boarding by as much as half the time.  From the article, his methods were to using sections of window seats first but alternate aisles so passengers would not interfere with each other.

This is a very clever idea.  Yet I found one flaw that may not have been assumed in his study.  I've noticed that overhead space is a premium for passengers, especially for business travelers.  Business travelers often bring two carry-on bags.  These bags tend to fill up the overhead bins rather quickly.  When the overhead bins fill up then passengers have to search in the aisles looking for available space for their bags.  This creates a bottleneck and queues develop for the other boarding passengers.  It seems to me that Jason's study makes an assumption that all overhead bins would be available at time of boarding.  If in fact alternating rows are used in his model than overhead bins might become filled to capacity before passengers board and create more bottlenecks.  Its just one theory that would be worth investigating before Jason's procedures are implemented.

I applaud Dr. Steffen's studies and finds in the airplane boarding problem.  It is a fascinating problem as most of us have encountered airplane boarding from time to time.  For more information on his methods you can read about Jason's work airplane boarding, which is very fascinating, on his website.