Finding hidden data patterns and correlations: The beer and diapers example

Once data patterns have been discovered, analysts must determine whether the correlations are incidental, accidental or in a cause-and-effect relationship.

This article originally appeared on the BeyeNETWORK.

Your intuition tells you that looking for and finding hidden patterns of data – correlations – is a good thing to do. In sales data, claims processing data, manufacturing data and human resource data, there are patterns of data that repeatedly occur. And with the mass of numbers that confront the corporation, these correlations of data are often hidden.

It then becomes the mission of the analyst to find these hidden patterns of data. For example the analyst may find:

  • That Ivy league employees drink too much after the age of forty,

  • That manufacturing productivity picks up after the 20th of the month,

  • That auto insurance claims tend to be for greater amounts in January and February,

  • That government employees have more sick days in July than any other month, and

  • That when people go shopping on Friday nights, they buy beer and diapers together.

Once the data patterns have been discovered, the following question is raised: Are the correlations incidental, accidental, or in a cause-and-effect relationship?

If you consider a large enough number of correlations, there will be a correlation between two or more based on accident alone. For example, there once was the correlation between the annual rise and fall of the stock market and the league of the winner of the Super Bowl. On years when the old NFL won, the stock market rose. On years when teams from the old AFL won, then the market would fall. This correlation continued for almost twenty years. Of course, the winner of the Super Bowl has nothing to do with the economy and productivity of the nation and the world. This is a fine example of a purely accidental or coincidental correlation.

An incidental correlation is one where there may be a cause, but the cause is not one or the other of the variables participating in the variables being studied. For example, consider the variables of the number of defective parts produced and the yield of production. These related numbers usually correlate nicely, but the numbers are not involved in a cause-and-effect relationship. Instead, there are many other factors that affect the productivity and quality of manufacturing.

However, there are the occasional cause-and-effect relationships. Consider the variables of blue-collar jobs and rate of pay. The rate of pay for a blue-collar job will be less than that of a white-collar profession. In this case, there is a definite cause-and-effect relationship in the correlation.

Now let’s get to the beer and diaper dilemma. Our industry has used this apocryphal metaphor for the example of data mining and correlative analysis for years. Interestingly, the origins of this metaphor are clouded. Many people think that there really is a correlation between the buying of beer and diapers together on a Friday night. Other people say that this is merely an example dreamed up by an executive to make a point, and the correlation never existed at all. Whatever the origins, and whatever the truth, the correlation between buying beer and diapers together is accepted by the information processing industry.

One of the intriguing aspects of the beer and diapers myth is that the response taken by the store is not at all clear. Most stores appear to say, “So what?” when it comes to the interpretation of the correlation between the buying of beer and diapers and what to do about it. One school of thought says that you should position beer and diapers closely together in a store. In doing so, you will maximize revenue. But another school of thought says that you should do something entirely different. This school of thought says that you should place the beer and diapers as far apart as possible. In doing so, you will maximize spontaneous spending. By forcing people to walk all over the store, you will maximize the chance that they will do some impulse buying. One school of thought maximizes immediate revenue, and the other school of thought maximizes impulse buying revenue.

Who knows who is right.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.

Editor's Note: More articles, resources and events are available in Bill's BeyeNETWORK Expert Channel. Be sure to visit today!

Dig Deeper on Predictive analytics