This article originally appeared on the BeyeNETWORK.
The objective of master data integration projects such as master data management (MDM), customer data integration (CDI) or product information management (PIM) is to confirm the existence of commonly used data objects across the existing application architecture and determine whether those data objects can be consolidated into a single, uniform object model. This object model, materialized in the form of a core master data repository coupled with the appropriate services layer, ultimately replaces the use of replicated copies of the similar representations within each application. So, essentially, in preparation for the consolidation phase, the assessment phase of a master data management program involves collecting metadata associated with the data objects used by the different applications; reviewing their similarities and differences; identifying organizational requirements not being met due to the absence of a consistent, consolidated repository; and developing the core master object and information models.
As is common in many information technology projects, though, there are two components to the information model: the structure and the content. Developing a data model for core master objects is challenging, especially when choosing the architectural style to be used for the master repository. But, as has typically been the bugaboo, it is the process of data aggregation and consolidation that poses the most significant hurdle.
The major issue is a process referred to as identity resolution, which basically refers to the ability to determine that two or more data representations can be resolved into one representation of a unique object. Note that I am not limiting the discussion to people’s names or addresses, since even though the bulk of data (and consequently, the challenge) is person or business names or addresses, there is a growing need for resolution of records associated with other kinds of data, such as product names, product codes, object descriptions, reference data, etc.
For a given data population, identity resolution can be viewed as a two-stage process. The first stage is one of discovery, and will combine data profiling activities with manual review of data. Typically, simple probabilistic models can be evolved that then feed into the second stage, which is one of similarity scoring and matching for the purpose of record linkage. The model is then applied to a much larger population of records, often taken from different sources, to link and presumably to automatically establish (within predefined bounds) that some sets of records refer to the same entity.
Usually, there are some bounds to what can be deemed an automatic match, and these bounds are not just dependent on the quantification of similarity, but must be defined based on the application. For example, there is a big difference between trying to determine if the same person is being mailed two catalogs instead of one and determining if the individual boarding the plane is on the terrorist list.
Does all this sound familiar? If you are aware of traditional data quality technology, it would appear that identity resolution is similar, if not identical, to the processes used for the determination of duplicates for the purpose of duplicate elimination, or for value-added processes such as householding. I suspect that as the applications of these techniques have been recognized for adding value beyond the standard name and address cleansing, there has been an opportunity to rebrand the techniques and penetrate into new markets such as regulatory compliance (e.g., “Know Your Customer”) and the previously mentioned MDM/CDI/PIM markets.
The use of identity resolution techniques may pose some interesting challenges within certain industries and geographies, most particularly when combined with the need to comply with privacy restrictions. Consider this situation: an organization has deployed a master data repository for individual information, and all enterprise-wide personal data has been consolidated into a single, unique record, and assigned a unique reference identifier. In order to retrieve an individual’s record, one must use the identifier as a key, but it is also possible that any specific individual may not recall his/her identifier. This indicates a requirement for a client application to invoke a “location” process that, given some set of identifying information, applies an identity resolution process to find the most likely match within the repository.
The issue here occurs when the identifying information is not sufficient to uniquely find that person within the system, and must present some suggested matches to the client who is searching for the individual. Yet, the simple task of presenting information back to the client application essentially reveals identifying information to which that client may not be allowed access according to the privacy restriction. This dilemma is one that is not a technical one, but rather a question of policy, which strongly suggests (in yet another way) that any master data consolidation and management program needs a set of governance policies that can be deployed.
Consider the following: How much identifying information needs to be stored in order to ensure that any individual’s data can be resolved into a master record? And what is the minimum amount that needs to be requested in order to, within a high degree of confidence, uniquely locate an individual? Lastly, what is the process for iteratively requesting additional data values to refine the similarity matching to perform the unique identification (or determine that the individual does not even have a record in the master repository)? The emergence of identity resolution as a standalone technical component will necessitate further exploration as to the policies and procedures that must accompany it.
Author's note: For the audio version of this article, please visit: http://www.knowledge-integrity.com/podcasts.htm.