One of the core activities in the maturing data science process is the use of data mining and machine learning algorithms to develop predictive models that aim to forecast customer behavior and other future events. But even the best-designed models can go astray if you don't feed the right data mining data sets into them upfront.
A predictive model essentially relies on a set of predictor variables whose values are expected to influence future activities. Weather modeling is a common example. The historical values of a set of variables related to environmental factors are analyzed to see which combinations preceded particular types of weather events such as hurricanes, snowstorms or sunny days. Then the analytical models are run against data on current conditions to generate forecasts.
Predictive models are also used in many different business applications. Banks rely on models that include age, marital status, place of residence, credit history and other variables to assess the risks associated with mortgage applications by prospective customers.
Financial services firms, telecommunications companies and other businesses run models with data on things such as historical purchasing patterns and call center interactions to predict when customers might be about to close their accounts. Online retailers analyze previous purchases and current clickstream data to recommend products and predict the probability that customers will complete purchases so they can make promotional offers, if necessary. And those are just a few examples.
Predictive models get some supervision
Predictive models are often developed using a process called "supervised learning," in which a set of predetermined outcomes is chosen, variables that might contribute to predicting them are identified, and statistical analysis algorithms are applied to a test data set to determine which variables are the most relevant predictors and how they should be weighted. Collecting suitable data mining data sets is a key step in that process, which looks to find the most statistically relevant variable values that precede each of the selected outcomes. The end result is a set of rules that map the weighting functions applied to the values of identified predictor variables to the chosen outcomes.
Using the earlier weather forecasting example, a data scientist or other analyst might choose five different weather scenarios: snowstorm, thunderstorm, sunshine, fog and wind. Next, he selects a collection of variables such as temperature, humidity, cloud cover, wind speed, sunrise time, location of high-pressure and low-pressure systems, and direction of the jet stream. The values for those variables are then collected and analyzed. The completed analysis will provide predictive guidelines like this: "Ninety-four percent of the time, when today's temperature is above 65 degrees, the humidity is below 20%, there is 10% cloud cover and there's a high-pressure system moving through the area, tomorrow will be a sunny day with a chance of clouds developing later in the day."
Data bias gives models errant views
Analytics teams face some big challenges in developing accurate predictive models with particular data sets. A basic challenge stems from the fact that data sets can have an inherent bias. As a result, a model might fit one data set very well but not be generally applicable to others.
That's why analysts typically divide the data sets they use into two groups: a training data set used to develop a model that can produce the desired output and a validation data set that can check for biases, verify that the model works properly and be tweaked as needed to get valid results. Some data scientists even go with three data sets, using separate ones to tweak the model and verify its accuracy.
To avoid faulty predictions, some real care must be taken when choosing the data mining data sets for predictive modeling efforts. First, make sure your data set contains enough data to fairly represent the real occurrences that you're trying to model and analyze. Also, make sure it's large and diverse enough to cover all the scenarios for the outcomes you're looking to model. Finally, divide it for the different stages of the model development process in a way that doesn't introduce or reinforce potential biases.
Starting with the right data sets will help improve the results of your data mining and predictive analytics projects. Using the wrong ones -- well, it's easy to predict how that will turn out.
More predictive modeling mistakes for analytics teams to avoid
How to make your data science program a more scientific process
Get the most out of a data science team on data mining and analytics