This chapter from Data Mining: Know it All, learn learn the difference between merging and concatenation and see...
By submitting your email address, you agree to receive emails regarding relevant topic offers from TechTarget and its partners. You can withdraw your consent at any time. Contact TechTarget at 275 Grove Street, Newton, MA.
an example of merging and concatenation during the data mining process.
2.7 Data Integration
The data necessary to compile the mining view usually comes from many different tables. The rollup and summarization operations described in the last two sections can be performed on the data coming from each of these data sources independently. Finally, we would be required to assemble all these segments in one mining view. The most used assembly operations are merging and concatenation . Merging is used to collect data for the same key variable (e.g., customer ID) from different sources. Concatenation is used to assemble different portions of the same data fields for different segments of the key variable. It is most useful when preparing the scoring view with a very large number of observations (many millions). In this case, it is more efficient to partition the data into smaller segments, prepare each segment, and finally concatenate them together.
Table of contents:
An introduction to data mining
Fielded applications of data mining and machine learning
The difference between machine learning and statistics in data mining
Information and examples on data mining and ethics
Data acquisition and integration techniques
What is a data rollup?
Calculating mode in data mining projects
Using data merging and concatenation techniques to integrate data
SAS provides several options for merging and concatenating tables together using DATA step commands. However, we could also use SQL queries, through PROC SQL, to perform the same operations. In general, SAS DATA step options are more efficient in merging datasets than PROC SQL is. However, DATA step merging may require sorting of the datasets before merging them, which could be a slow process for large datasets. On the other hand, the performance of SQL queries can be enhanced significantly by creating indexes on the key variables used in merging.
Because of the requirement that the mining view have a unique record per category of key variable, most merging operations required to integrate different pieces of the mining view are of the type called match-merge with nonmatched observations. We demonstrate this type of merging with a simple example.
We start with two datasets, Left and Right, as shown in Table 2.6 .
The two tables can be joined using the MERGE – BY commands within a DATA step operation as follows:
INPUT ID Age Status $;
1 30 Gold
2 20 .
4 40 Gold
5 50 Silver
DATA Right; INPUT ID Balance Status $;
2 3000 Gold
4 4000 Silver
Table 2.6 Two Sample Tables: Left and Right
Table 2.7 Result of Merging: Dataset Both
MERGE Left Right;
PROC PRINT DATA = Both;
The result of the merging is the dataset Both given in Table 2.7, which shows that the MERGE-BY commands did merge the two datasets as needed using ID as the key variable. We also notice that the common file Status was overwritten by values from the Right dataset. Therefore, we have to be careful about this possible side effect. In most practical cases, common fields should have identical values. In our case, where the variable represented some customer designation status (Gold or Silver), the customer should have had the same status in different datasets. Therefore, checking these status values should be one of the data integrity tests to be performed before performing the merging.
Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2009. Data Mining: Know It All by Chakrabarti et all. For more information about this title and other similar books, please visit www.elsevierdirect.com.
Merging datasets using this technique is very efficient. It can be used with more than two datasets as long as all the datasets in the MERGE statement have the common variable used in the BY statement. The only possible difficulty is that SAS requires that all the datasets be sorted by the BY variable. Sorting very large datasets can sometimes be slow.
You have probably realized by now that writing a general macro to merge a list of datasets using an ID variable is a simple task. Assuming that all the datasets have been sorted using ID before attempting to merge them, the macro would simply be given as follows:
%macro MergeDS(List, IDVar, ALL);
DATA & ALL;
MERGE & List; by
Finally, calling this macro to merge the two datasets in Table 2.6 would simply be as follows:
%let List = Left Right;
%let IDVar = ID;
%let ALL = Both;
%MergeDS( & List, & IDVar, & ALL);
Concatenation is used to attach the contents of one dataset to the end of another dataset without duplicating the common fields. Fields unique to one of the two files would be filled with missing values. Concatenating datasets in this fashion does not check on the uniqueness of the ID variable. However, if the data acquisition and rollup procedures were correctly performed, such a problem should not exist.
Performing concatenation in SAS is straightforward. We list the datasets to be concatenated in a SET statement within the destination dataset. This is illustrated in the following example.
Start with two datasets TOP and BOTTOM, as shown in Tables 2.8 and 2.9.
We then use the following code to implement the concatenation of the two datasets into a new dataset:
input ID Age Status $;
1 30 Gold
2 20 .
3 30 Silver
4 40 Gold
5 50 Silver
Table 2.8 Table: Top
Table 2.9 Table: Bottom
input ID Balance Status $;
6 6000 Gold
7 7000 Silver
SET TOP BOTTOM;
SET TOP BOTTOM;
The resulting dataset is shown in Table 2.10 .
More on data mining:
- Download a PDF of this chapter for free: "Data Acquisition and Integration"
- Read other excerpts from data management books in the Chapter Download Library.
As in the case of merging datasets, we may include a list of several datasets in the SET statement to concatenate. The resulting dataset will contain all the records of the contributing datasets in the same order in which they appear in the SET statement.
Table 2.10 Table: Both
The preceding process can be packed into the following macro:
%macro ConcatDS(List, ALL);
DATA & ALL;
SET & List;
To use this macro to achieve the same result as in the previous example, we use the following calling code:
%let List = TOP BOTTOM;
%let ALL = BOTH;
%ConcatDS( & List, & ALL);