Best description of data science

When I asked Bruno Aziza last year how he would best describe a data scientist, his answer still sticks with me today. “Think of a data scientist more like the business analyst-plus,” he told me. Part mathematician, part business strategist, these statistical savants are able to apply their background in mathematics to help companies tame their data dragons. But these individuals aren’t just math geeks, per se.

“A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a Renaissance individual who really wants to learn and bring change to an organization.” — Anjul Bhambhri, Vice President of Big Data Products, IBM

Favorite Websites for Data Scientists

Shown are the favorite websites using self reported results of all registered users on DataScienceCentral. It is interesting to note that Kdnuggets continues to be a strong leader as a data aggregator for the community even among the biased sample of DataScienceCentral members. It is also impressive to see R with such a strong pressence.

Favorite websites of Data Scientists
Self reported favorite websites based on over 40k registered users on DataScienceCentral

The list of links based on popularity is as follows:

Skills of a data quant

I’m really looking forward to the upcoming session at Predictive Analytics World: Necessary Skills of the Quant Across Sectors

What makes a quant a true rockstar? What kind of soft skills, what kind of tech skills and background, and what portfolio of experience? With the organizational process behind predictive analytics – across business applications such as fraud and marketing – something of an art form, the requisite skills of key analytics staff are multidimensional and often hard to nail down. This expert panel will grab a hammer and start defining exactly what’s needed in this very particular workforce.

Recently I saw the the article: Tips for hiring a data scientist in which it highlighted the importance of intellectual curriosity.

Article Review: Coca Cola’s Orange Juice Algorithm

Orange Juice Algorithm

Coca Cola has been using an algorithm to increase revenue by perfecting the taste and the chain supply of one of it’s up and coming segments, 100% natural orange juice. I happen to be a huge fan of orange juice and Coca Cola’s marketing campaigns so naturally the recent article in BusinessWeek stood out to me. I commend the entire company and especially Coke’s vice president of business acceleration, Doug Bippert, for their dedication to using data driven, quantitative solutions. In fact upon more research it was so impressive it even encouraged me to look into the company’s stock. (Disclaimer I am long on KO)

About the Algorithm “Black Book”

Black Book, the algorithm they describe, was implemented by industry expert Bob Cross at Revenue Analytics. According to the article they use “up to 1 quintillion decision variables”… that is a lot of factors! Some of those include “external factors such as weather patterns, expected crop yields, and cost pressures.”

The article suggested that the model’s response “includes detailed data about the myriad flavors – more than 600 in all – that make up an orange, and customer preferences.” Additional output includes “how to blend batches”, predictions for supply chain out as far as “15 months”, and optimal time to pick the oranges. Finally, not only can they accurately model all these but as they state, “we can quickly replan the business in 5 or 10 minutes just because we’ve mathematically modeled it.”

How to build your own Black Book Algorithm

Here I propose the steps for building your own predictive model, regardless if you choose to use a consultant or implement it yourself.

  1. Define response(s) and what you want to model.
  2. Collect data on all possible factors that might influence your responses and their associated responses as defined previously.
  3. Use a variable reduction technique to reduce your factors (ie. take it down from 1 quintillion to something a little more realistic, say 10000 or less)
  4. Create a model. This takes a little experience and data knowledge but it could be as simple as a linear model however more likely they used a form of boosted forests.
  5. The model will tell you the relative influence from each factor. Here you’ll consider reducing the significant factors even more to probably the top 100 for each response.
  6. Predict the future. Apply the model to future or anticipated values for each of the significant factors. Since you are only looking at the most influential factors this part should easily take less than a few minutes each run.
  7. Optimize future response. Now you can also run simulations to experiment and optimize the future output. Depending on the complexity of the model and the ability to converge on an optimal response this last part could take longer than 20 minutes.

Comments appreciated. If you would like to talk more about implementation details for a project you are working on please feel free to contact me.