Business competitiveness and agility are increasingly dependent on decisions that are informed and fueled by business intelligence (BI), reporting and analytics. For example, in an emerging “age of the algorithm,” operational applications and processes are often enhanced as a result of business analytics. Meanwhile, power-user analysts explore various business scenarios by combining multiple large data sets, in many cases containing both structured and unstructured information.
As this dependence on BI grows, it should not be a surprise that business analytics users must have an implicit trust in their decisionmaking processes, which implies a reliance on having trustworthy data available to them.
Data quality is especially critical as the size of data volumes and the number of data sources grow, but what is meant by “high-quality data”? Data management professionals typically define data quality in terms of “fitness for use,” but that concept rapidly becomes obsolete as we consider the numerous ways that the same data sets are repurposed and ultimately reused.
Assessing BI data quality
From the analytics standpoint, data quality is best defined in terms of the degree to which data flaws impair the analytical results. Within an organization, that can be assessed using the following dimensions:
Completeness, which measures whether a data set contains the full number of records or instances that it should, as well as the degree to which each data instance has a full set of values for its mandatory data elements. Incomplete data can have a detrimental effect on analysis, particularly in the context of aggregations (such as sums and averages) that are skewed by missing data values.
Accuracy, for checking data values against their real-world counterparts— for example, confirming that telephone numbers entered into a system match the actual numbers. A small number of inaccuracies in a large data set might not have statistical relevance; but as with an incomplete data set, a larger number of inaccuracies will skew results. In addition, incorrect values can expose your organization to business impacts such as missed opportunities for revenue generation, increased costs and heightened risks.
Currency, which focuses on how up to date the data sets being analyzed are. It is inadvisable to make critical business decisions based on stale data, so ensuring that your analytical data is current is vital.
Consistency, which considers the degree to which the information in different data sets can be corroborated, as well as value agreement within and across sets of records. For example, in a record representing the terms of a contract, a begin date that is later in time than the contract end date would be a glaring inconsistency. Data sets that are inconsistent pose integration and merging issues, leading to duplicated and potentially inaccurate information.
Errors: No harm, no foul
How can it be determined if source data is suitable for its many potential uses? The answer is simplified by correlating data errors and issues to the potential downstream business impact. The quality of a data set typically is acceptable as long as any errors do not affect business outcomes. As a result, organizations should use a collaborative approach to define measures, methods of scoring and levels of acceptability for all analytical usage scenarios.
This view of data quality can be illustrated by an example: an online mega-retailer might analyze millions of daily transactions to look for emerging patterns that create opportunities for product bundling and cross-selling. Because of the volume of records and the expected outcome, a small number of data errors can be tolerated. However, the retailer might not tolerate any data flaws when using the same data sets in responding to specific customer support questions.
In other words, data quality requirements are directly related to the way data is used in individual business applications, including BI and analytics. Establishing the necessary level of trust in analytical data involves engaging business users and understanding what they will be doing with the data. With specific knowledge of how data errors can affect decision making, controls can then be incorporated into data lifecycle management processes to ensure compliance with data quality rules, monitor quality against the agreed-to levels of acceptability and alert data stewards about quality issues while potentially triggering automated cleansing of data to meet downstream needs.
When done properly, the payoffs are enormous: Implementing effective data quality management and control procedures as part of your BI program will help lead to the data consistency, predictability and trustworthiness that are so critical to successful business analytics initiatives.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting, training and development company that focuses on information management, data quality and business intelligence. Loshin also is the author of four books, including The Practitioner’s Guide to Data Quality Improvement and Master Data Management. He can be reached at firstname.lastname@example.org.