This article originally appeared on the BeyeNETWORK.
This article will address a basic precept for multi-data set integration, some of its associated issues and some ideas for optimizing the process.
First of all, the goal of an entity identification and integration project is to determine that a pair of instances of entity representations taken from two data sets refers to the same thing. Depending on the application, the rules for determining similarity may be stricter or looser, but even in the strictest sense; there must be some identifying key that is shared by both records in the pair. So let’s examine that case:
Presume we have two data sets, each of which contains records associated with individuals. Let’s also presume that each record contains a person’s last name, first name and an identifier assigned by the same organization to uniquely identify each individual. That identifier may be assigned externally, such as (in the United States) a social security number or it may be assigned internally, such as a customer number or membership number. In this case, the simplest way to determine that two records represent the same individual is that the unique identifier in the first record is exactly the same as the unique identifier in the second record. Of course, this also presumes a high level of data quality, in that there are no numbers mistakenly associated with the wrong person. No doubt about it, this is the most straightforward approach to data integration.
So how does one implement this approach? The most naïve approach to this kind of data match is to take each record in the first data set, extract out the unique identifier, then iteratively take each record in the second data set, extract out its unique identifier, and compare the two. If the identifiers match, then we conclude that the two records refer to the same individual. In terms of performance, for each of the m records in first data set, we compare it to each of the n records in the second data set. The time it takes to complete a match is therefore proportional to the product of the sizes of the two data sets, or m × n.
Well, you say, we would not implement it that way, because each of the data sets is stored in a table in a relational database management system (or even VSAM files—same idea), indexed by the unique identifier. Instead we would take each record in the first table, extract out the unique identifier and do a lookup in the second data set for that identifier. Performance-wise, this is better; we can assume that the indexing scheme is B-tree based, and that the search time for each lookup is proportional to the logarithm of the number of records in the second table (ln n), and the overall time is then proportional to m × ln n. We’ll call these the database approaches.
This is quite a bit better, in fact, but still can be bested by thinking a little more strategically with respect to performance. The first thing to note is that in both of our approaches, we have spent time accessing data that may be unnecessary, especially when the first record does not have a match in the target data set. The second thing to note is that we are really comparing one set of single values against another set of single values, which is more of a “mathematical” set intersection problem and less of a database one. To this end, let’s look at two different approaches that bail out on the database for the sake of implementing the set intersection: hash tables and bit vectors.
A hash table is a data structure that captures the relationship between an object’s key value and the object itself, mapped as an index into array calculated using a function of the key itself. In a good hash table, the amount of time it takes to access the object is constant—it consists of calculating the function to determine the array index, and then accessing the array entry. At the risk of glossing over some details (e.g., there are some additional algorithmic tweaks that I leave as an exercise to the reader—consult Knuth’s algorithms books), the cost of inserting objects into a hash table is proportional to the number of objects, and the cost of looking an object up in the hash table is also constant time. Therefore, for our matching process, the time for determining the set of matching identifiers between the two sets is proportional to the number of records in the first data set plus the number of elements in the second data set, or m + n.
Alternately, if the unique identifiers are not too randomly assigned (i.e., social security numbers are not good), we can use a different representation of a set for our purposes. A bit vector is an array of single bits. Each value represents an index into the entire array for the associated bit. In other words, for unique identifier 120998, the 120,998th bit is set to a 1 to indicate membership in the set. Yet again, a bit vector is a data structure that is described in most algorithm texts, and is easily programmed. We initially set all bits in the bit vector to 0. For each identifier in the target set, we index into the bit vector and set the corresponding bit to 1. For each identifier in the lookup data set, we just index into the bit vector and see if the corresponding bit is set to 1; if so, we have a match, otherwise, not. Similarly to the hash table, the amount of time it takes to set the bit vector is proportional to the number of elements in the target data set, and the number of searches is equal to the number of records in the search data set, or m + n. We’ll call these the algorithmic approaches.
So what is the big deal here, and why do we care what the performance calculation is? If your data sets are relatively small, you probably don’t care. But let’s say you are matching two data sets, each of which with 1,000,000 records. In the database approach (m × ln n), ln (1,000,000) is approximately 14, and the number of comparisons for the best-performing database approach is proportional to about 14,000,000. Using the algorithmic approaches, the amount of work is proportional to the sum of the number of data elements, or about 2,000,000 comparisons, or 14 percent of the work; if the database approach takes seven hours, the algorithm approach would be finished in an hour.
Despite this performance increase, a development team might be more likely to favor the database approach over the algorithmic approach, for a few reasons. First of all, the individuals deploying these matches are more likely to be “data people,” who are less inclined to develop software that pulls data out of the database for analysis. Consequently, it is easier to implement a database approach, and without the expertise associated with performance analysis, there isn’t anyone around to point out the opportunity for improvement. Contrarily, “algorithmists” are often directed towards software projects and steered away from the data, and in that case they are not integrated into the integration process.
In truth, it would be worthwhile for both sorts of developers to team up to explore creative ways to aggregate data into actionable knowledge. In future articles we will look at other approaches to the simple matching process as a precursor to more complex matching, the use of tools for integration and ways to capture the knowledge gained during the data integration process. If you have specific questions or comments about this topic, please don’t hesitate to send me an email at firstname.lastname@example.org.