Guide to managing a data quality assurance program
A comprehensive collection of articles, videos and more, hand-picked by our editors
Every one of my clients suffers terribly at the hands of data's evil twin sisters: inconsistent definitions and duplicates. And once the evil twins latch on to an organization, it's hard to shake them. Like a two-headed anaconda, the sisters encircle an organization and squeeze slowly and relentlessly until the lights go out.
Most of my clients recognize the value of data -- or rather, the pain of low-quality and inconsistent data. Corporate executives learn the hard way that without reliable BI and analytics data, they're running the business blind. They can't gauge customer interest in a new product in order to modify designs or marketing campaigns; they can't pinpoint manufacturing or supply chain problems to avoid cost overruns and delays; they can't accurately forecast sales to optimize inventory and distribution management processes.
Business execs also discover that bad data leaves them flat-footed. They can't respond nimbly to changing business conditions because they don't realize that the conditions are changing. The market continually catches them by surprise, and their carefully planned business strategies become out of date overnight. As a result, they stand by helplessly while fleet-footed competitors sprint past, luring away hard-won customers.
The problem is that too many organizations haven't figured out how to corral the twin sisters. Often, executives blame the IT department, erroneously thinking that managing data quality is a technical problem. (It isn't!) Then they call in me -- or another consultant -- to exorcise the evil twins so they can run their companies with 20/20 vision and sprinter's speed.
Different views of what data is
The first sister has a pervasive effect, because she spawns inconsistent data everywhere. In every organization, I hear the same refrain: "We get different answers to the same questions because people define terms differently." Ironically, the most commonly used terms tend to be the most poorly defined. I once heard a colleague say, "The most dangerous question you can ask a client is, 'What's your definition of customer?'" Executives from different departments often become quite vexatious debating the issue. To marketing, a customer is someone who responds to a campaign; to sales, it's someone who has signed a purchase order; in finance, it's someone who has paid a bill. Each is right, but collectively they're all wrong.
The solution to data inconsistency is obvious: a corporate data dictionary that spells out in plain English the definitions of commonly used terms and metrics. But creating one is hard. Gaining consensus on definitions is fraught with politics -- people fight tooth and nail to ensure that their definitions prevail in the corporate catechism.
To overcome the politics, the CEO needs to appoint a cross-functional committee of subject matter experts to prioritize terms and propose definitions for each. The executive team then needs to review and refine the committee's definitions and establish corporate standards. Often, this requires a lot of additional discussion and arm wrestling until the executives reach a consensus -- or more likely, a truce.
Typically, they agree to disagree. They create a corporate definition for each term and then aggregate departmental data so it conforms to the standard definitions, while still maintaining distinct local definitions. That way, the organization gets a singular definition and each department preserves its view of the world. The approach works as long as everyone uses and adheres to the data dictionary.
Data duplicates spread defects
The second sister is more pernicious than the first. She works surreptitiously behind the scenes to undermine data values. This sister liberally sprinkles defects into database tables and fields using a variety of means: data entry and programming errors, system migration mishaps, flawed legacy system rewrites and just plain data obsolescence.
For example, in a customer database, 5% of the records deteriorate in quality each month due to death, marriage, divorce and change of address. Worse, an even higher percentage of customer records spawn inconsistent duplicates, largely because most organizations house customer data in multiple databases supporting different applications that capture different attributes of a customer at different times for different reasons. Then they face the perplexing problem of trying to figure out whether "Joe Daley, 51, of 1 Prescott Lane" is the same as "J. Daley, 53, of 10 Presque Lane" and "Joseph Dailey, 49, of 1 Prescot Ln."
Keeping data harmonized across applications and systems is difficult. Ideally, companies store master data in one application and system only; that way, it never gets replicated or out of sync with itself. But most organizations spawn systems and applications like wildflowers after a spring rain. The only reliable way to harmonize customer records and other transactional data in such environments is to apply master data management (MDM) processes.
An MDM program is designed to prevent inconsistent data duplication. Typically, a hub system is used to store "golden records" of data. The MDM hub collects new data from transaction systems and runs it through a matching algorithm to determine whether corresponding records already exist. If they do, the hub updates them accordingly and makes sure the data is consistent; if they don't, it creates new records. It then makes the changes available to all subscribing applications in batch mode or real time.
Caging data's evil twin sisters isn't easy. But they must be confronted, apprehended, handcuffed and sent up the river in order to resolve data quality issues and lay a solid foundation for reporting and analytics. Your business depends on it: Without clean, consistent and harmonized data, companies cannot compete effectively in today's economy.
About the author:
Wayne Eckerson is principal consultant at Eckerson Group, a consulting firm that helps business leaders use data and technology to drive better insights and actions. His team provides information and advice on business intelligence, analytics, performance management, data governance, data warehousing and big data. Email him at email@example.com.
Read more from Eckerson, on Hadoop systems providing prime real estate for spreadmarts
See why he thinks Hadoop 2 and YARN are poised to shake up data management and analytics
Get Eckerson's take on the ongoing need for data warehouses in the big data era