Table of contents:
An introduction to data mining
Simple data mining examples and datasets
Fielded applications of data mining and machine learning
The difference between machine learning and statistics in data mining
Information and examples on data mining and ethics
Data acquisition and integration techniques
What is a data rollup?
Calculating mode in data mining projects
Using data merging and concatenation techniques to integrate data
This chapter first provides a brief review of data sources and types of variables from the point of view of data mining. Then it presents the most common procedures of data rollup and aggregation, sampling, and partitioning.
2.2 Sources of Data
In most organizations today, data is stored in relational databases. The quality and utility of the data, as well as the amount of effort needed to transform the data to a form suitable for data mining, depends on the types of the applications the databases serve. These relational databases serve as data repositories for the following applications.
2.2.1 Operational Systems
Operational systems process the transactions that make an organization work. The data from these systems is, by nature, transient and keeps accumulating in the repository. A typical example of these systems is any banking transaction processing system that keeps records of opened and closed accounts, deposits, withdrawals, balances, and all other values related to the money moving among accounts, clients, and the outside world. Data extracted from such operational systems is the most raw form of data, in the sense that it has not been transformed, cleansed, or changed. It may contain errors as a result of data entry procedures or applications and usually has many missing values. It is also usually scattered over several tables and files. However, it is the most honest representation of the status of any business.
2.2.2 Data Warehouses and Data Marts
Data warehouses and data marts were conceived as a means to facilitate the compilation of regular reports on the status of the business by continuously collecting, cleaning, and summarizing the core data of the organization. Data warehouses provide a clean and organized source of data for data mining. In most cases, however, data warehouses were not created to prepare data for data modelers; they were rather created with a certain set of reporting functions in mind. Therefore, data residing in them might have been augmented or processed in a special way to facilitate those functions. Ideally, a specialized data mart should be created to house the data needed for data mining modeling and scoring processes.
Online analytical processing (OLAP) and similar software are often given the name business intelligence tools. These applications reside in the data warehouse, or have their own data warehouse, and provide a graphical interface to navigate, explore, and "slice and dice" the data. The data structures that OLAP applications operate on are called cubes. They also provide comprehensive reporting capabilities. OLAP systems could be a source of data for data mining because of the interactive exploration capabilities that they offer the user. Therefore, the user would find the interesting data elements related to the problem through OLAP applications and then apply data mining modeling for prediction.
Alternatively, data mining can offer the identification of the significant variables that govern the behavior of some business measure (such as profit), and then OLAP can use these variables (as dimensions) to navigate and get qualitative insight into existing relationships. Data extracted from OLAP cubes may not be granular enough for data mining. This is because continuous variables are usually binned before they can be used as dimensions in OLAP cubes. This binning process results in the loss of some information, which may have a significant impact on the performance of data mining algorithms.
Surveys are perhaps the most expensive source of data because they require direct interaction with customers. Surveys collect data through different communication channels with customers, such as mail, email, interviews, and forms on websites. There are many anecdotes about the accuracy and validity of the data collected from the different forms of surveys. However, they all share the following two common features:
- The number of customers who participate in the survey is usually limited because of the cost and the number of customers willing to participate.
- The questions asked in the survey can be designed to directly address the objective of the planned model. For example, if the objective is to market new products, the survey would ask customers about their preferences in these products, whether they would buy them, and what price would they pay for them.
These two points highlight the fact that, if well designed and executed, surveys are indeed the most accurate representation of possible customer behavior. However, they usually generate a limited amount of data because of the cost involved.
2.2.5 Household and Demographic Databases
In most countries, databases are commercially available that contain detailed information on consumers of different products and services. The most common type is demographic databases based on a national census, where the general demographic profile of each residential area is surveyed and summarized. Data obtained from such database providers is usually clean and information rich. Their only limitation is that data is not provided on the individual customer or record level but rather is averaged over a group of customers, for example, on the level of a postal (ZIP) code. Such limitations are usually set by privacy laws aimed at protecting individuals from abuse of such data.
The use of averaged data in models could lead to diluting the model's ability to accurately define a target group. For example, extensive use of census-like variables in a customer segmentation model would eventually lead to a model that clusters the population on the basis of the used census demographics and not in relation to the originally envisaged rate of usage or buying habits of the planned products or services.
It is not uncommon that analysts collect data from more than one source to form the initial mining view and for the scoring of mining models.
2.3 Variable Types
Designers of applications that use databases and different file systems attempt to optimize their applications in terms of the space required to keep the data and the speed of processing and accessing the data. Because of these considerations, the data extracted from databases is often not in optimal form from the point of view of data mining algorithms. To appreciate this issue, we provide the following discussion of the types of variables that most data mining algorithms deal with.
2.3.1 Nominal Variables
Nominal, or categorical, variables describe values that lack the properties of order, scale, or distance between them. For example, the variables representing the type of a housing unit can take the categories House, Apartment, or Shared Accommodation. One cannot enforce any meaning of order or scale on these values. Other examples include Gender (Male, Female), Account Type (Savings, Checking), and type of Credit Card (VISA, MasterCard, American Express, Diners Club, EuroCard, Discover, etc.).
From the point of view of data mining algorithms, it is important to retain the lack of order or scale in categorical variables. Therefore, it is not desirable that a category be represented in the data by a series of integers. For example, if the type of a house variable is represented by the integers 1 to 4 (1 = Detached, 2 = Semidetached, 3 = Townhome, 4 = Bungalow), a numeric algorithm may inadvertently add the numbers 1 and 2, resulting implicitly in the erroneous and meaningless statement of "Detached + Semidetached = Townhome"! Other erroneous, and equally meaningless, implications that "Bungalow > Detached" or "Bungalow − Semidetached = Townhome − Detached." The most convenient method of storing categorical variables in software applications is to use strings. This should force the application to interpret them as nominal variables.
2.3.1 Ordinal Variables
Ordinal, or rank or ordered scalar, variables are categorical variables with the notion of order added to them. For example, we may define the risk levels of defaulting on a credit card payment into three levels (Low, Medium, High). We can assert the order relationships High ≥ Medium ≥ Low. However, we cannot establish the notion of scale. In other words, we cannot accurately say that the difference between High and Medium is the same as the difference between Medium and Low levels of risk.
Based on the definition of ordinal variables, we can realize the problem that would arise when such variables are represented by a series of integers. For example, in the case of the risk level variable, representing these levels with numbers from 1 to 3 such that (Low = 1, Medium = 2, High = 3) would result in the imposition of an invalid notion of distance between the different values. In addition, this definition would impose the definition of scale on the values by implying that Medium risk is double the risk of Low, and High risk is three times the risk of Low.
Some ordinal variables come with the scale and distance notions added to them. These are best represented by a series of positive integers. They usually measure the frequency of occurrence of an event. Examples of such ordinal measures are number of local telephone calls within a month, number of people using a credit card in a week, and number of cars purchased by a prospective customer in her or his lifetime.
A typical problem, especially in data warehouses, exists in the representation of ordinal measures. Some ordinal measures are often subjected to binning to reduce the values we need to store and deal with. For example, a data warehouse may bin the number of times a customer uses a credit card per month to the representation 0 – 5 → 1, 6 → 10 → 2, 11 – 20 → 3, more than 20 → 4. Although this leads to a more compact representation of the variables, it may be detrimental to data mining algorithms for two reasons: (1) it reduces the granularity level of the data, which may result in a reduction in the predictive model accuracy, and (2) it distorts the ordinal nature of the original quantity being measured.
2.3.3 Real Measures
Real measures, or continuous variables, are the easiest to use and interpret. Continuous variables have all the desirable properties of variables: order, scale, and distance. They also have the meanings of zero and negative values defined. There could be some constraints imposed on the definition of continuous variables. For example, the age of a person cannot be negative and the monthly bill of a telephone line cannot be less than the subscription fees. Real measures are represented by real numbers, with any reasonably required precision.
The use of ratios in constrained continuous variables is sometimes troublesome. For example, if we allow the balance of a customer to be negative or positive, then the ratio between $ − 10,000.00 and $ − 5,000.00 is the same as that between $ + 10,000.00 and $ + 5,000.00. Therefore, some analysts like to distinguish between the so-called interval and ratio variables. We do not make that distinction here because in most cases the context of the implementation is clear. For example, if we wished to use the ratio of balances, we would restrict the balances to positive values only; if negative values occurred, we would devise another measure to signify that fact.
With the three types of variable from the mining point of view, the first task the analyst should consider, when acquiring the data, is to decide on the type of data to be used for each variable depending on its meaning. Of special interest are variables that represent dates and times. With the exception of time series analysis, dates and times are not useful in their raw form. One of the most effective methods of dealing with date and time values is to convert them to a period measure, that is, to calculate the difference between the values and a fixed reference value. For example, instead of dealing with the date of opening an account, we deal with total tenure as the difference between today's date and the date of opening the account. In fact, we use this method every day by referring to the age of a person instead of her or his birth date. In this way, we convert dates and times to real measures, with some constraint if necessary, as in the case of a person's age. (Negative age is not well defined!)
More on data mining:
This was first published in May 2009