Article Review: Coca Cola’s Orange Juice Algorithm

Orange Juice Algorithm

Coca Cola has been using an algorithm to increase revenue by perfecting the taste and the chain supply of one of it’s up and coming segments, 100% natural orange juice. I happen to be a huge fan of orange juice and Coca Cola’s marketing campaigns so naturally the recent article in BusinessWeek stood out to me. I commend the entire company and especially Coke’s vice president of business acceleration, Doug Bippert, for their dedication to using data driven, quantitative solutions. In fact upon more research it was so impressive it even encouraged me to look into the company’s stock. (Disclaimer I am long on KO)

About the Algorithm “Black Book”

Black Book, the algorithm they describe, was implemented by industry expert Bob Cross at Revenue Analytics. According to the article they use “up to 1 quintillion decision variables”… that is a lot of factors! Some of those include “external factors such as weather patterns, expected crop yields, and cost pressures.”

The article suggested that the model’s response “includes detailed data about the myriad flavors – more than 600 in all – that make up an orange, and customer preferences.” Additional output includes “how to blend batches”, predictions for supply chain out as far as “15 months”, and optimal time to pick the oranges. Finally, not only can they accurately model all these but as they state, “we can quickly replan the business in 5 or 10 minutes just because we’ve mathematically modeled it.”

How to build your own Black Book Algorithm

Here I propose the steps for building your own predictive model, regardless if you choose to use a consultant or implement it yourself.

  1. Define response(s) and what you want to model.
  2. Collect data on all possible factors that might influence your responses and their associated responses as defined previously.
  3. Use a variable reduction technique to reduce your factors (ie. take it down from 1 quintillion to something a little more realistic, say 10000 or less)
  4. Create a model. This takes a little experience and data knowledge but it could be as simple as a linear model however more likely they used a form of boosted forests.
  5. The model will tell you the relative influence from each factor. Here you’ll consider reducing the significant factors even more to probably the top 100 for each response.
  6. Predict the future. Apply the model to future or anticipated values for each of the significant factors. Since you are only looking at the most influential factors this part should easily take less than a few minutes each run.
  7. Optimize future response. Now you can also run simulations to experiment and optimize the future output. Depending on the complexity of the model and the ability to converge on an optimal response this last part could take longer than 20 minutes.

Comments appreciated. If you would like to talk more about implementation details for a project you are working on please feel free to contact me.

Demand the (Right) Right Data

Mark Twain once noted,”The difference between the right word and the ‘almost right word’ is like the difference between lightning and a lightning bug.” So too with data.

Know the data, where it comes from and what it means.
~ http://feeds.harvardbusiness.org/~r/harvardbusiness/~3/EMdMiZoflMI/demand_the_right_right_data.html

Analytical driven management

“Interestingly, the top performers also turn out to be the organizations most focused on improving their use of analytics and data, despite the fact that they’re already ahead of the adoption curve,” said Michael S. Hopkins, editor-in-chief of MIT Sloan Management Review. “We discovered that the more managers know about analytics-driven management — and see how it can create value — the more they know that they want to know more.”

Dealing with the Data Deluge – MIT News Office
http://web.mit.edu/newsoffice/2010/smr-analytics.html

Visualizing Multivariate Data in R

When doing exploritory data analysis it is imperative to properly visualize large and often complex multivariate datasets. Regardless of the size, it is usually the multivariate part that makes it hard to interpret.

http://addictedtor.free.fr/graphiques/ has a collection of hundreds of graphs and their source code. Once you can get your data in the right format this makes it quick and easy to visualize it.

Here are some plots that I think would be very usefull:

That turned out to be a long list but there are some good ones there and I can think of a useful application for each one of them right off the top of my head. The next step will just be implementing them when time permits.

Some Data Mining Resources

I’ve pulled together some data mining resources for future reference. Let me know if you find a good one that I’m missing.

2011 Conferences
http://www.iienet.org/Annual2/Details.aspx?id=7270
http://www.amstat.org/meetings/jsm/2011/index.cfm
JMP conference http://www.jmp.com/about/events/summit2011/papers_overview.shtml

Websites
http://www.kdnuggets.com/
http://blogs.sas.com/jmp/
http://www.scientificcomputing.com/Data_Analysis/

Magazines
SAScom

Free MIT courses can be accessed online http://ocw.mit.edu/courses/
Example courses that stood out to me just now
Data mining
Statistics and Visualization for Data Analysis and Inference
Street fighting mathematics
Introduction to modeling and simulation
Computing and Data Analysis
Models, Data, and inference for socio-technical systems

Certificates
http://scpd.stanford.edu/

Computational Statistics

One of my favorite books is Statistical Computing with R by Maria L. Rizzo where she describes computational statsics as an area within statistics that uses computational, graphical and numerical approches to solve statistical problems. Computational statistics encompasses exploritory data analysis, Monte Carlo methods, and data partitioning.

In her book she says “The inceasing interest in massive and streaming data sets, and high dimensional data arising in applications of biology and engineering, for example, demand improved and new computational approaches for multivariate analysis and visualization.”

In my current company, my possition gives me the opportunity to analize a constant stream of manufacturing observations as they relate to several thousand engineering variables. This exciting opportunity will allow me to implement many of the same multivariate methods and design of experiments that can be applied to finance and quantitative trading.

Engineering Data Analysis

Hadley Wickham gave an excelent presentation earlier this month at Google Tech. I found this talk on http://www.r-statistics.com/2011/06/engineering-data-analysis-with-r-and-ggplot2-a-google-tech-talk-given-by-hadley-wickham/ He focused on domain specific languages and the use of R as the programming language in data analysis.

Transform
subset
mutate
arrange
summarize
*
by operator (ddply)
+
join
match_df

He also gave an excelent example for a real data set including the R code.

His conclusions:
A programming language gives you: reproducibility, automation, communication, but has a learning curve
R gives you: freedom, a community, connectivity, building blocks, but the community can be prickly and it is slow (relative to other languages).
Thoughtful DSLs should make it easir to solve common data analysis problems.

Design of Experiments

We conduct experiements to improve process performance and product quality. The company’s default analysis method of choice has always been to use an ANOVA to pick the best setting among a small set of options. To be correct, this would be considered a one factor experiment however typically it is viewed as a way of choosing the best of a set of recipes where the recipe may actually change more than one factor at a time.

Since the beginning of the year when I came on board there has been a big push to use correct application of DOE especially when changing more than one factor at a time. Fundamentally this methodology invovles predicting a regression model that describes the response(s) as a function of the variable factors of choice.

There has been some resistance within the company, but we are excited to be moving forward and soon begin analyzing some of our first few experiments. Jinu Antony’s book Design of Experiments for Engineers and Scientists is a brief but excelent resource. Chapter 4 gives a great introduction to the methodology starting with the barriors to successful exeriment design. It continues with a breif, but sufficently comprehensive outline for practical experimetal methodology from start to finish. Chapter 8 is also a great chapter “Some useful and practical tips for making your industrial experiments successful.”

It is in secton 8.1.9 that this book introduces the question “How many exprimental runs are required to identify significant effect(s), given the current process variation?” This has been a question on my mind since we have started partly because we are limited by our process and because of the way things are done at the company. These constraints encourage us to limit the total count of all samples of all variables and levels in an experiment to 25 or if necessary multiples of 25.

Here at the company I have caused quite the debate about how to answer this question. I’ll present my solutions in detail later on. Unfortunately I have not found an all inclusive method to precisely determine the sample size because there are so many ways to design an experiment. Certianly your confidence (alpha and beta) come into play along with historic variation due to natural processes. The response and it’s detectability are also dependant on how small of a delta relative to the noise as well as the distribution of the response.

Furthermore, when we conduct experiments it would be ideal to identify critical responses that are expected to move or at least ones that we would like to determine if they moved or not. Alternatively the approach has typically been to run an experiment and then test ALL responses that we measure. The count of all these responses is on the order of 1000+. To my understanding, the main objective is to first determine which group best meets the needs of the experimenter’s goals and objectives then observe all other responses to ensure they don’t deviate beyond the current process variation. If there are any shifts or unexpected changes in any of the responses then their anticipated changes and trade offs can be understood.

Soon I’d also like to go into more detail about the challenges of DOE with multiple responses. I’ll also include solutions I’ve found so for for weighting and automatically selecting which responses are significant and worth investigating further. I’d also like to compare this process with experimental analysis using ANOVA. Hopefully I can establish a statistically sound method that is easy enough to follow for any of the company’s employees that have time constraints or who currently lack the depth of academic understanding.