How to evaluate the viability of a data mining project

In a book excerpt, author David Nettleton details questions to ask and factors to consider in weighing the potential benefits of data mining projects.

This is an excerpt from Chapter 2, "Business Objectives," from the book Commercial Data Mining: Processing, Analysis and Modeling for Predictive Analytics Projects, by David Nettleton. Nettleton is a consultant and academic researcher with more than 25 years of IT experience, primarily in databases and data analysis. In this chapter, he explains how to set up a data analysis project for success by identifying specific business objectives in the planning stage. He goes on to discuss other considerations related to data availability, project scope and data quality. Using simple mathematical formulas, Nettleton also describes how to estimate the effectiveness of a data mining project prior to execution.

A commercial data analysis project that lives up to its expectations will probably do so because sufficient time was dedicated at the outset to defining the project's business objectives. What is meant by business objectives? The following are some examples:

  • Reduce the loss of existing customers by three percent.
  • Augment the contract signings of new customers by two percent.
  • Augment the sales from cross-selling products to existing customers by five percent.
  • Predict the television audience share with a probability of 70%.
  • Predict, with a precision of 75%, which clients are most likely to contract a new product.
  • Identify new categories of clients and products.
  • Create a new customer segmentation model.
'Commercial Data Mining' cover image

The first three examples define a specific percentage of precision and improvement as part of the objective.

In the fourth and fifth examples, an absolute value is specified for the desired precision for the data model. In the final two examples the desired improvement is not quantified; instead, the objective is expressed in qualitative terms.

Criteria for choosing a viable project

This section enumerates some main issues and poses some key questions relevant to evaluating the viability of a potential data mining project. The checklists of general and specific considerations provided here are the bases for the rest of the chapter, which enters into a more detailed specification of benefit and cost criteria and applies these definitions to two case studies.

Evaluation of potential commercial data analysis projects -- General considerations

Copyright info

This excerpt is from the book Commercial Data Mining: Processing, Analysis and Modeling for Predictive Analytics Projects, by David Nettleton. Published by Morgan Kaufmann Publishers, Burlington, Massachusetts. ISBN 9780124166585. Copyright 2014, Elsevier BV. To download the full book for 25% off the list price of this and other books until the end of 2014, visit the Elsevier store and use the discount code PBTY14.

The following is a list of questions to ask when considering a data analysis project:

  • Is data available that is consistent and correlated with the business objectives?
  • What is the capacity for improvement with respect to the current methods? (The greater the capacity for improvement, the greater the economic benefit.)
  • Is there an operational business need for the project results?
  • Can the problem be solved by other techniques or methods? (If the answer is no, the profitability return on the project will be greater.)
  • Does the project have a well-defined scope? (If this is the first instance of a project of this type, reducing the scale of the project is recommended.)

Evaluation of viability in terms of available data -- Specific considerations

The following list provides specific considerations for evaluating the viability of a data mining project in terms of the available data:

  • Does the necessary data for the business objectives exist, and does the business have access to it?
  • If part or all of the data does not exist, can processes be defined to capture or obtain it?
  • What is the coverage of the data with respect to the business objectives?
  • What is the availability of a sufficient volume of data over a required period of time, for all clients, product types, sales channels and so on? (The data should cover all the business factors to be analyzed and modeled. The historical data should cover the current business cycle.)
  • Is it necessary to evaluate the quality of the available data in terms of reliability? (The reliability depends on the percentage of erroneous data and incomplete or missing data. The ranges of values must be sufficiently wide to cover all cases of interest.)
  • Are people available who are familiar with the relevant data and the operational processes that generate the data?

Factors that influence project benefits

Business Objective

Assigning a Value for Percent Improvement
The percentage improvement should always be considered with regard to the current precision of an existing index as a baseline. Also, the new precision objective should not get lost in the error bars of the current precision. That is, if the current precision has an error margin of +/-3% in its measurement or calculation, this should be taken into account.

There are several factors that influence the benefits of a project. A qualitative assessment of current functionality is first required: what is the current grade of satisfaction of how the task is being done? A value between 1 and 0 is assigned, where 1 is the highest grade of satisfaction and 0 is the lowest, where the lower the current grade of satisfaction, the greater the improvement and, consequently, the benefit will be.

The potential quality of the result (the evaluation of future functionality) can be estimated by three aspects of the data: coverage, reliability and correlation:

  • The coverage or completeness of the data, assigned a value between 0 and 1, where 1 indicates total coverage.
  • The quality or reliability of the data, assigned a value between 0 and 1, where 1 indicates the highest quality. (Both the coverage and the reliability are normally measured variable by variable, giving a total for the whole data set. Good coverage and reliability for the data help to make the analysis a success, thus giving a greater benefit.)
  • The correlation between the data and its grade of dependence with the business objective can be statistically measured. A correlation is typically measured as a value from –1 (total negative correlation) through 0 (no correlation) to 1 (total positive correlation). For example, if the business objective is that clients buy more products, the correlation would be calculated for each customer variable (age, time as a customer, zip code of postal address, etc.) with the customer's sales volume.

Once individual values for coverage, reliability and correlation are acquired, an estimation of the future functionality can be obtained using the formula:

Future functionality = (correlation + reliability + coverage)/3

An estimation of the possible improvement is then determined by calculating the difference between the current and the future functionality, thus:

Estimated improvement = Future functionality - Current functionality

A fourth aspect, volatility, concerns the amount of time the results of the analysis or data modeling will remain valid.

Volatility of the environment of the business objective can be defined as a value of between 0 and 1, where 0=minimum volatility and 1=maximum volatility. A high volatility can cause models and conclusions to become quickly out of date with respect to the data; even the business objective can lose relevance. Volatility depends on whether the results are applicable over the long, medium or short terms with respect to the business cycle.

Note that this a priori evaluation gives an idea for the viability of a data mining project. However, it is clear that the quality and precision of the end result will also depend on how well the project is executed: analysis, modeling, implementation, deployment and so on. The next section, which deals with the estimation of the cost of the project, includes a factor (expertise) that evaluates the availability of the people and skills necessary to guarantee the a posteriori success of the project.

Read an interview with David Nettleton here.

Dig Deeper on Business intelligence data mining