David Nettleton, the author of Commercial Data Mining: Processing, Analysis and Modeling for Predictive Analytics Projects, is a consultant and academic researcher with extensive experience in data analysis processes. In this interview, Nettleton discusses the challenges that organizations face when planning and executing data mining projects and other advanced analytics initiatives. He also offers advice on developing effective data mining applications and tips for keeping data analysis efforts on track, which often involves using a variety of methods to meet stated business objectives.
What do you find are the most challenging aspects of implementing a data analysis project?
David Nettleton: It depends. Some things we expect to be easy become more difficult as the project progresses, and others we expect to be more difficult become easy. The first project step is to define one or more business objectives. This may be achieved quickly, or things may get bogged down at the outset.
Then, a brainstorming session is necessary to choose the most viable objectives. This is the next step: evaluating the viability, which is linked to the availability of the data. Obtaining, filtering and preparing the right data is often a crucial step. Project members may find it more interesting to go directly to data analysis rather than the routine work of preparing and validating the data.
However, data preparation is a key aspect that determines the success or failure of the later analysis and mining phases. We may find that the required data variables don't exist, and we have to obtain them. We may find that some key variables are available but that the data is erroneous or in an incorrect format. Another problematic step is deployment. We must decide how we will utilize the results in our business processes.
How can the people involved in these projects contribute to their success? What skill sets are helpful?
Nettleton: A mix of business and IT skills is generally desirable. The availability of someone who has a good knowledge of the data is also a plus. And previous experience on this type of project is clearly an advantage. Initially, a marketing or [business] manager may propose one or more business objectives. Next, the IT manager will make an initial list of the data required in order to fulfill each business objective, and will then scrutinize the company database to see if the stated data is currently available. Once we have the right data, we will then need the collaboration of an analyst who is adept at using the chosen data analysis and mining tool.
Is there such a thing as analyzing too much data?
Nettleton: This depends on the business objectives. A study of outliers, such as in fraud detection, may require the exhaustive processing of all available data in order to catch the exceptions. If we want to perform customer segmentation, do we need all the customers in order to do this? The answer is no, as long as we are able to extract a representative sample from the complete data set.
Big data also means specialist software, such as Hadoop, and specialist hardware, such as clusters of servers. Also, data volume can be measured in width (number of descriptive variables) and/or in length (number of records). We could have a billion records with four variables, or we could have a million records with a hundred variables.
For each variable, we have to question why we need it, and for the volume of records, we have to ask what their coverage is. Why process 10 years of historical data when the current business cycle can be represented in a two-year period? If we are a small or medium-sized company with limited processing power, we must consider the cost of processing the data versus the expected benefit obtained from mining it.
What are some of the most common mistakes people make on data mining projects, and how can they be avoided?
Nettleton: Three general types of analysis errors are due to bias in the data, errors in data processing and wrong interpretation.
More book excerpts related to data management
Read what a data quality expert has to say about metrics for assessing data quality
Get advice for creating an information management strategy
The first type of error can be related to incorrect sampling, or skewed data. For example, we want to study anti-smoking health advertising on women between 18 and 35 years of age, and in our data set all the records correspond to ex-smokers. This can be mitigated by checking the data on the fly for correct distributions in terms of the key variable categories of interest.
The second type of error can be due to selecting the wrong data, or errors in formatting with invalid date values, flags and so on. This is mitigated by spending more time and effort on the preprocessing phase, and the availability of a person skilled in data extraction and migration.
The third type of error, misinterpretation, may be due to lack of experience in data analysis or making an excessive generalization. Another associated problem is insufficient coverage -- for example, if data refers to just one geographic region instead of the whole country.
Other data analysis problems include:
- The lack of the right data for the task. This problem may be associated with having chosen an unviable business objective at the outset.
- Analysts relying on just one technique for data analysis, which may be their favorite technique or the one they know best. It's worthwhile to invest the time and effort to learn how to use a selection of distinct methods.
- Using an output variable -- that is, the future result -- as an input variable and consequently getting a fantastic predictive precision.
Check out an excerpt from David Nettleton's book, Commercial Data Mining: Processing, Analysis and Modeling for Predictive Analytics Projects.