Using data merging and concatenation techniques to integrate data

Learn two data integration techniques, data merging and concatenation, and see how to combine and merge data sets in this excerpt from the book Data Mining: Know it All.

Data Mining: Know it All
This chapter from Data Mining: Know it All, learn learn the difference between merging and concatenation and see an example of merging and concatenation during the data mining process.

Table of contents:

An introduction to data mining
Simple data mining examples and datasets
Fielded applications of data mining and machine learning
The difference between machine learning and statistics in data mining
Information and examples on data mining and ethics
Data acquisition and integration techniques
What is a data rollup?
Calculating mode in data mining projects
Using data merging and concatenation techniques to integrate data

2.7 Data Integration

The data necessary to compile the mining view usually comes from many different tables. The rollup and summarization operations described in the last two sections can be performed on the data coming from each of these data sources independently. Finally, we would be required to assemble all these segments in one mining view. The most used assembly operations are merging and concatenation . Merging is used to collect data for the same key variable (e.g., customer ID) from different sources. Concatenation is used to assemble different portions of the same data fields for different segments of the key variable. It is most useful when preparing the scoring view with a very large number of observations (many millions). In this case, it is more efficient to partition the data into smaller segments, prepare each segment, and finally concatenate them together.

2.7.1 Merging
SAS provides several options for merging and concatenating tables together using DATA step commands. However, we could also use SQL queries, through PROC SQL, to perform the same operations. In general, SAS DATA step options are more efficient in merging datasets than PROC SQL is. However, DATA step merging may require sorting of the datasets before merging them, which could be a slow process for large datasets. On the other hand, the performance of SQL queries can be enhanced significantly by creating indexes on the key variables used in merging.

Because of the requirement that the mining view have a unique record per category of key variable, most merging operations required to integrate different pieces of the mining view are of the type called match-merge with nonmatched observations. We demonstrate this type of merging with a simple example.

Copyright Info
Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2009. Data Mining: Know It All by Chakrabarti et all. For more information about this title and other similar books, please visit www.elsevierdirect.com.

Example 2.1
We start with two datasets, Left and Right, as shown in Table 2.6 .

The two tables can be joined using the MERGE – BY commands within a DATA step operation as follows:

DATA Left;
  INPUT ID Age Status $;
  datalines;
  1 30 Gold
  2 20 .
  4 40 Gold
  5 50 Silver
 ;
RUN;

DATA Right; INPUT ID Balance Status $;
  datalines;
  2 3000 Gold
  4 4000 Silver
;
RUN;

Table 2.6 Two Sample Tables: Left and Right

ID
Table: Left
ID
Table: Right
Age Status Balance Status
1 30 Gold 2 3000 Gold
2 20 . 4 4000 Silver
4 40 Gold      
5 50 Silver      

Table 2.7 Result of Merging: Dataset Both

Obs ID Age Status Balance
1 1 30 Gold .
2 2 20 Gold 3000
3 4 40 Silver 4000
4 5 50 Silver .

DATA Both;
MERGE Left Right;
BY ID;
RUN;
PROC PRINT DATA = Both;
RUN;

The result of the merging is the dataset Both given in Table 2.7, which shows that the MERGE-BY commands did merge the two datasets as needed using ID as the key variable. We also notice that the common file Status was overwritten by values from the Right dataset. Therefore, we have to be careful about this possible side effect. In most practical cases, common fields should have identical values. In our case, where the variable represented some customer designation status (Gold or Silver), the customer should have had the same status in different datasets. Therefore, checking these status values should be one of the data integrity tests to be performed before performing the merging.

Merging datasets using this technique is very efficient. It can be used with more than two datasets as long as all the datasets in the MERGE statement have the common variable used in the BY statement. The only possible difficulty is that SAS requires that all the datasets be sorted by the BY variable. Sorting very large datasets can sometimes be slow.

You have probably realized by now that writing a general macro to merge a list of datasets using an ID variable is a simple task. Assuming that all the datasets have been sorted using ID before attempting to merge them, the macro would simply be given as follows:

%macro MergeDS(List, IDVar, ALL);
DATA & ALL;
  MERGE & List; by
 & IDVar;
run;
%mend;

Finally, calling this macro to merge the two datasets in Table 2.6 would simply be as follows:

%let List = Left Right;
%let IDVar = ID;
%let ALL = Both;
%MergeDS( & List, & IDVar, & ALL);

2.7.2 Concatenation
Concatenation is used to attach the contents of one dataset to the end of another dataset without duplicating the common fields. Fields unique to one of the two files would be filled with missing values. Concatenating datasets in this fashion does not check on the uniqueness of the ID variable. However, if the data acquisition and rollup procedures were correctly performed, such a problem should not exist.

Performing concatenation in SAS is straightforward. We list the datasets to be concatenated in a SET statement within the destination dataset. This is illustrated in the following example.

Example 2.2
Start with two datasets TOP and BOTTOM, as shown in Tables 2.8 and 2.9.

We then use the following code to implement the concatenation of the two datasets into a new dataset:

DATA TOP;
input ID Age Status $;
datalines;
1 30 Gold
2 20 .
3 30 Silver
4 40 Gold
5 50 Silver
;
run;

Table 2.8 Table: Top

Obs ID Age Status
1 1 30 Gold
2 2 20 .
3 3 30 Silver
4 4 40 Gold
5 5 50 Silver

Table 2.9 Table: Bottom

Obs ID Balance Status
1 6 6000 Gold
2 7 7000 Silver

DATA BOTTOM;
input ID Balance Status $;
  datalines;
 6 6000 Gold
 7 7000 Silver
  ;
run;
DATA BOTH;
  SET TOP BOTTOM;
run;
DATA BOTH;
  SET TOP BOTTOM;
run;

The resulting dataset is shown in Table 2.10 .

As in the case of merging datasets, we may include a list of several datasets in the SET statement to concatenate. The resulting dataset will contain all the records of the contributing datasets in the same order in which they appear in the SET statement.

Table 2.10 Table: Both

Obs ID Age Status Balance
1 1 30 Gold .
2 2 20 . .
3 3 30 Silver .
4 4 40 Gold .
5 5 50 Silver .
6 6 . Gold 6000
7 7 . Silver 7000

The preceding process can be packed into the following macro:

%macro ConcatDS(List, ALL);
DATA & ALL;
  SET & List;
run;
%mend;

To use this macro to achieve the same result as in the previous example, we use the following calling code:

%let List = TOP BOTTOM;
%let ALL = BOTH;
%ConcatDS( & List, & ALL);

More on data mining:

  • Download a PDF of this chapter for free: "Data Acquisition and Integration"
  • Read other excerpts from data management books in the Chapter Download Library.
  • This was first published in May 2009

    Dig deeper on Business intelligence data mining

    Pro+

    Features

    Enjoy the benefits of Pro+ membership, learn more and join.

    0 comments

    Oldest 

    Forgot Password?

    No problem! Submit your e-mail address below. We'll send you an email containing your password.

    Your password has been sent to:

    -ADS BY GOOGLE

    SearchDataManagement

    SearchAWS

    SearchContentManagement

    SearchCRM

    SearchOracle

    SearchSAP

    SearchSQLServer

    Close