This article originally appeared on the BeyeNETWORK
Very early in the development of IT, we discovered that wise old dictum of garbage in, garbage out. GIGO, the catchy acronym for this dictum, reinforced the point that in this discipline, as well as most others, the ingredients and raw materials are essential if you want a good end product. Just as you cannot make a good apple pie with rotten apples, you cannot produce useful reports with bad data.
GIGO is particularly important in the context of business intelligence (BI) because, by definition, business intelligence is supposed to provide us with insights into our business needs from the bits and bytes we have collected.
Let us remember that there are two primary data domains: the operational and the analytic. Operational data is about transactions. It is the data that allows us to operate our business enterprises and is essential to run them. Analytical data, on the other hand, is a prerequisite to decision making and improvement. There can be no purposeful improvement of a situation unless there has been some analysis done. This analysis is normally carried out on transactional data that has been collected in the operational domain and then in some way manipulated, integrated and transformed to facilitate the analysis. This is at the heart of the business intelligence process, and we have all seen what happens when information that has never been collected is needed for decision support.
Well, the same, or worse, is true when we have collected transactional information, but its quality is woefully lacking. We might as well not have it since we often run the risk of producing results – generating insights? – that are truly misleading. The company that estimates future sales based on faulty data about its past orders is probably going to develop incorrect manufacturing plans.
The public sector is not immune from data quality problems. In many cases, it faces even more difficult situations than the private sector does. At least the latter has a built-in correction mechanism called the marketplace that will eventually force that enterprise to fix the data or bear the brunt of a competitive disadvantage that will result in loss of sales and revenue.
The government faces problems that cover the gamut of data quality attributes: accuracy, completeness, consistency, timeliness, uniqueness and validity. In spite of its power to legislate, regulate and enforce, data that has been collected by the government can also be inaccurate, incomplete, inconsistent or out of date. It is not necessarily worse than in the private sector, but it is certainly not better in many cases.
Let us first look at problems of accuracy. Accuracy refers to the percent of cases where the data matches reality. In other words, if a state’s Department of Motor Vehicle files have records indicating that 47.8% of the recipients of driver’s licenses were females, then its accuracy will be measured by the extent to which that database reflects reality. In data that is collected from self-service paper or electronic forms, the individual may have an incentive to be truthful or careful in filling it out…but not necessarily. For example, how accurate is the weight listed on your driver’s license?
It is not uncommon for data to be incomplete. This means that certain fields may be unfilled. Data collection by government organizations has traditionally been done using a “form-centric” approach. One of the consequences of this has been that frequently a form originally designed for one purpose is being used for another in order to avoid spending money on development of a new form. (For example, “Effective immediately, Standard Form X will now also be used to apply for Y and Z type licenses as well for all of its previous uses.”) One undesirable consequence of this is that many sections of the form are often not applicable and, hence, are left blank.
Incompleteness relates to the most vexing issue of nulls or blanks. If a field is left blank, is this an acceptable value? How do you interpret a person’s name when that line on the form is blank? While it is unlikely that nulls can actually be a person’s name, blanks can be a valid entry. For example, there are very many records where “Address Line 2” is blank, and there certainly are enough people with no middle names; but how do you differentiate between a valid entry blank (e.g., the person has no middle name) versus an incomplete or incorrect entry? Now think of this in the context of a Homeland Security screening program at your airport where you are not only trying to make sure you keep bad guys off the planes, but also trying to minimize or avoid false positives that incorrectly target an innocent traveler.
It becomes even more complicated when you are dealing with quantitative information. Here a blank is often taken to mean zero. However, because this may not be universally so, it can certainly be very misleading. Think, for example, of a record in an Air Force database where the field for the altitude of a plane is left blank. Does this mean zero or that the plane was on the ground? Or does it mean that the respondent did not know the altitude? Or was that field irrelevant to the purpose and intentionally left blank? All of these occur in the context of government data.
We are all familiar with issues of inconsistency. Often this is the result of a lack, or multiplicity of, standards or code structures. How many ways can you represent a specific state? You would be surprised. Many of us still use the old FLA, PENN or MASS abbreviations carried over from legacy applications. Why use MASS? Maybe because it is more intuitive than remembering the differences between MA, MD, ME, MI, MN, MO, MS and MT. Problems arise more commonly with the code structures or “short names” for fields that include: cities, airports, make and models of airplanes, military ranks, agricultural products, type of housing properties, medical specialties, medical conditions or university abbreviations. These are all essential components of transactional data if you work at the FAA or the National Science Foundation or the Departments of Labor, Agriculture, Housing, Defense, or Health and Human Services.
There is also the issue of timeliness, or simply getting data that is out of date. This can be more or less important depending on the context; and while it is very often frustrating to the user, it may be unfair to ding the government for this. In many cases, the government has the only data available, period. It may be the result of a one-time survey or study. (The National Institutes of Health or the National Center for Labor Statistics have all sorts of interesting data.) Or, it may be the result of a legal requirement that calls for this type of data to be collected only every few years. (The Census Bureau is constitutionally mandated to conduct a census every ten years.) When you access this kind of information, it usually is with the double emotions of glee (that you have something) and frustration (that it may be very old).
When the data that you require is necessary to meet another deadline, you will often find it necessary to use the old data. Rather than miss a deadline, we are forced to use available data to make estimates. All of these things happen, and happen often in many parts of the Federal government enterprise.
Lastly, there is the issue of data availability. There are times when we need data that has never been collected, has been collected in an incompatible format, or has not been collected electronically. With respect to the latter two, these may ultimately be just other examples of timeliness because it is possible that conversions may allow you to take data from a spreadsheet (or even paper) and make it available to your analytical engines. However, what if the data has never been collected?
A case in point is when an agency wants to report on its grants or specific activities by ethnic group but discovers that no information on ethnicity was ever requested in the application process. This presents a rather common conundrum, especially as the Congress often requests information ex post facto from the Executive Branch – information that this branch never expected they would be required to provide.
There are many reasons why data quality issues can generate problems with business intelligence in the public sector, and we’ve just touched the tip of the iceberg with the examples provided in this article. There are many data integrity tools than can assist tremendously in addressing these problems and substantially alleviating them. It is important to remember that to obtain solid business intelligence from our bits and bytes, we must avoid garbage in.
Dr. Barquin is the President of Barquin International, a consulting firm, since 1994. He specializes in developing information systems strategies, particularly data warehousing, customer relationship management, business intelligence and knowledge management, for public and private sector enterprises. He has consulted for the U.S. Military, many government agencies and international governments and corporations.
Dr. Barquin is a member of the E-Gov (Electronic Government) Advisory Board, and chair of its knowledge management conference series; member of the Digital Government Institute Advisory Board; and has been the Program Chair for E-Government and Knowledge Management programs at the Brookings Institution. He was also the co-founder and first president of The Data Warehousing Institute, and president of the Computer Ethics Institute. His PhD is from MIT. Dr. Barquin can be reached at email@example.com.
Editor's note: More government articles, resources, news and events are available in the BeyeNETWORK's Government Channel. Be sure to visit today!