Coca Cola has been using an algorithm to increase revenue by perfecting the taste and the chain supply of one of it’s up and coming segments, 100% natural orange juice. I happen to be a huge fan of orange juice and Coca Cola’s marketing campaigns so naturally the recent article in BusinessWeek stood out to me. I commend the entire company and especially Coke’s vice president of business acceleration, Doug Bippert, for their dedication to using data driven, quantitative solutions. In fact upon more research it was so impressive it even encouraged me to look into the company’s stock. (Disclaimer I am long on KO)

# About the Algorithm “Black Book”

Black Book, the algorithm they describe, was implemented by industry expert Bob Cross at Revenue Analytics. According to the article they use “up to 1 quintillion decision variables”… that is a lot of factors! Some of those include “external factors such as weather patterns, expected crop yields, and cost pressures.”

The article suggested that the model’s response “includes detailed data about the myriad flavors – more than 600 in all – that make up an orange, and customer preferences.” Additional output includes “how to blend batches”, predictions for supply chain out as far as “15 months”, and optimal time to pick the oranges. Finally, not only can they accurately model all these but as they state, “we can quickly replan the business in 5 or 10 minutes just because we’ve mathematically modeled it.”

# How to build your own Black Book Algorithm

Here I propose the steps for building your own predictive model, regardless if you choose to use a consultant or implement it yourself.

**Define response(s)**and what you want to model.**Collect data**on all possible factors that might influence your responses and their associated responses as defined previously.- Use a
**variable reduction**technique to reduce your factors (ie. take it down from 1 quintillion to something a little more realistic, say 10000 or less) **Create a model**. This takes a little experience and data knowledge but it could be as simple as a linear model however more likely they used a form of boosted forests.- The model will tell you the
**relative influence from each factor**. Here you’ll consider reducing the significant factors even more to probably the top 100 for each response. **Predict the future**. Apply the model to future or anticipated values for each of the significant factors. Since you are only looking at the most influential factors this part should easily take less than a few minutes each run.**Optimize future response**. Now you can also run simulations to experiment and optimize the future output. Depending on the complexity of the model and the ability to converge on an optimal response this last part could take longer than 20 minutes.

Comments appreciated. If you would like to talk more about implementation details for a project you are working on please feel free to contact me.