This article originally appeared on the BeyeNETWORK.
A conundrum introduced by the growing desire for an organizational “360-degree view” of customers is the need to consolidate the existence of many representations of a single individual into a single representation. The difficulty lies in the fact that despite our need to model, and therefore correlate a single data instance to the single individual that is being modeled, those individuals may themselves take on many “identities” in real life.
As an example, consider how many different ways people use to refer to the 43rd President of the United States: “Bush,” “George Bush,” “George W. Bush,” “Dubya” and “43” are all commonly used names. Each of these is distinct, and in a set of databases all these names could exist in separate records presumed to represent different people, already creating some confusion. Yet, let’s see what happens when we add to the mix some of the names used to refer to the 41st President of the United States: “Bush,” “George Bush” and “George H. W. Bush.”
All of a sudden there are more representations of a different person that are nominally distinguishable from the first! Consequently, even presuming our information were valid and free of any flaws, it might be a challenge to determine that two (or more) names in records refer to the same person. Should we drop our pretense of presumption of flawless data, the problem seems even more intractable.
This example points out the two basic challenges associated with a customer data integration (CDI) application:
- The ability to determine linkage between two records and that they refer to the same entity (i.e., a “match”); and
- The ability to distinguish the non-connectivity between two data records, and that they don’t refer to the same entity (i.e., a “non-match”).
By boiling down the problem to matching vs. non-matching, it gives us the freedom to realize that we don’t have to rely solely on names as our matching/non-matching criteria. The basic idea behind record linkage is that each record is assumed to be a representation of some real-world entity. By assuming that each column supplies some specific attribute associated with that entity, if we can find enough attribute matches of significance, we will have enough evidence to presume that two records refer to the same entity.
Usually we have lots of data that can contribute to making this decision, including explicit knowledge, inferred knowledge and embedded knowledge. Not only that, using the concept of linkage, we can look at exploiting connectivity to contribute to our matching criteria. Let’s take a closer look at these approaches:
Explicit knowledge includes entity attribution that is clearly presented in the data instance. For example, consider records from two different database tables that contain both name data and contact information, such as telephone number and mailing address. If the two records match exactly in name, address and telephone number, then any system would mark those two records as a match.
The only significant challenges in exploiting explicit knowledge for customer data integration is in identifying linking fields across data sets. For example, if we look at many separate applications developed across vertical organizations, we might find that the same data element concept is named and implemented in different ways. For example, a customer number may be referred to as “CUSTNUM” in one table and represented as a numeric value, while it may be called “CUST_ID” in a different table and be represented as an alphanumeric value. This kind of information should be captured within the enterprise metadata repository, which not only will capture semantic information, but also can add hierarchy and taxonomy knowledge to help in CDI.
However, what do we do if the names don’t completely match, but are similar, and the address and telephone numbers match exactly? The focus here becomes a question of how similarity is measured. Even though a human can stare at two names and immediately make a link, computer programs are not necessarily so savvy, and may rely on more complex algorithms to compute the degree of similarity and then combine that with other inputs to derive the inference that solidifies the link. In our case, the closeness of the names paired with the exactness of the corresponding data is enough to make the connection.
There are other perspectives to be reviewed with respect to making inferences about individuals. For example, information context can be made into an explicit attribute. For example, the existence of a name “JOHN FRANKLIN” in a mailing list comprised of Dental Professionals helps in establishing a link to “J FRANKLIN, DDS” in another table. A different perspective comes from historical data, in which different linking field values change in isolation over time. For example, consider two records in a customer contact database, where both agree on name and customer number, but where the contact information changes at some point in time, corresponding to a time when the customer moved from one address to another. Establishing the change of address at some fixed point in time allows the linkage of records from different data sets with different addresses.
The significant challenges for exploiting implicit knowledge is the integration of available technology, meta-knowledge about the data sets to be aggregated and the subject matter expertise needed to effectively make decisions about inferences. Good technology can help in proposing matches, but this must be accompanied by good decision-making to be effective.
As a way of exposing our own humanity when trying to insert information into a predefined data model, we frequently find times when an application impedes someone’s ability to get the necessary data into the system. In these cases, people often will put data values into places where they don’t really belong. I have seen many of examples of this, such as inserting a fax number into a field for foreign country names, or putting people names in street address fields. In the name of information purity, companies will frequently cleanse this data out of the table to ensure the data’s compliance with stated expectations.
Many of these occurrences are due to system inflexibility, while others may be relics of purpose-driven model design. A frequent example is the melding of party data with its corresponding contact information—the standard model consisting of Name, Address Line 1, Address Line 2, City, State Zip and Phone Number. In the numerous variations, we often see parts of names elided from the Name field into Address Line 1, or P.O. Box numbers strewn across various fields, or names called out for attention or delivery in either address field.
While many professionals view these instances as nuisances, I prefer to see these as opportunities, since there is often a lot of knowledge embedded within those values, especially when unexpected values begin to appear with some frequency. As an example, in a recent review of an employer address database, our consultants were able to establish a link between three different employers by showing that each of the employers had address records that made reference to the same three individual contacts, all of whose names were embedded within the Address Line 1 field.
The challenges in exploiting embedded knowledge is threefold: Being able to recognize that embedded knowledge is present; being able to extract those values from the fields, in which they are embedded; and understanding how to build a “connectivity model,” in which the embedded information can be made explicit for use in linkage.
In this article I have only just scratched the surface of the multitude of issues associated with customer data integration. We only discussed three (of many) aspects of the application, and all of them merit a more detailed treatment. But hopefully, even the limited discussion here will provide some insight into the many ways that customer data can be leveraged to provide that desirable 360-degree view.