This article originally appeared on the BeyeNETWORK.
Part 1 of this series by Shaku Atre covers top-down logical data modeling.
Bottom-Up Source Data Analysis
Data analysis cannot stop after top-down logical data modeling because the source data often does not follow the business rules and policies captured during the data modeling sessions. If bottom-up source data analysis is not performed, the data problems and business rule violations would not be discovered until the extract, transform and load (ETL) process. Some data quality problems would not be discovered at all until after implementation, and then only if somebody complained about them. As Figure 1 shows, source data mapping must adhere not only to the usual technical data conversion rules but more so to the business data domain rules and to the business data integrity rules.
Figure 1: Source Data Mapping Rules
Technical Data Conversion Rules
The following technical rules should be observed for any type of data conversion:
- The data types of the source data element must match the data types of the target data element.
- The data length of the source data element must be adequate for being moved, expanded or truncated into the target data element.
- The programs manipulating the data elements must be compatible with the contents of the data elements.
Business Data Domain Rules
Business data domain rules are rules about the semantics (meaning and interpretation) of data contents. As a result, business data domain rules require much more effort in preparing the data for conversion. The business community considers these rules to be more important than the technical data conversion rules. A source data element can meet all technical data conversion rules but its contents can still be wrong. The business data domain rules are used to identify and correct data violations similar to those listed in Figure 2.
Data Domain Violations
|1. Missing data values (big issue on BI projects)|
|2. Default values; for example 0, 999, FF, blank|
|3. Intelligent "dummy" values, which are specific default (or dummy) values that actually have a meaning; for example, Social Security Number of 888-88-8888 being used to indicate that the person is a non-resident alien|
|4. Logic embedded in a data value; for example: using lower-valued ZIP codes (postal codes) to indicate a state on the east coast, such as 07456 in New Jersey, and higher-valued ZIP codes to indicate a state on the west coast, such as 91024 in California|
|5. Cryptic and overused data content; for example: the values "A, B, C, D" of a data element define type of customer, while the values "E, F, G, H" of the same data element define type of promotion, and the values "I, J, K, L" define type of location|
|6. Multipurpose data elements – that is programmatically and purposely redefined data content; the most obvious example being the "redefines" clause in COBOL statements|
7. Multiple data elements embedded in or concatenated across, or wrapped around free-form text fields; for example: Address Lines 1 through 5 containing name and address data elements:
Address Line 1 Brokovicz, Meyers, and Co
Figure 2: Data Domain Violations
Business Data Integrity Rules
Similar to business data domain rules, business data integrity rules are much more important to improving information quality than are the technical data conversion rules. Some examples of violations to business data integrity rules are listed in Figure 3.
Data Integrity Violations
|1. Contradicting data content between two or more data elements; for example: Boston, CA (instead of MA)|
|2. Business rule violation; for example: Date of Birth = 05/02/1985 and Date of Death for the same person = 11/09/1971|
|3. Reused primary key (same key value used for multiple object instances); for example: two employees with the same employee number|
|4. No unique primary key (multiple key values for the same object instance); for example: one customer with multiple customer numbers|
|5. Objects without their dependent parent object; for example: job assignment points to employee 3321, but there is no employee 3321 in the employee database|
|6. A real-world relationship between two data objects that cannot be built in the database due to a gap in business knowledge among operational systems|
Figure 3: Data Integrity Violations
Every critical and important data element must be examined for these defects, and a decision must be made whether and how to correct them. The information consumers (businesspeople who will be using those data elements to make business decisions) and data owners should make that decision after discussing the impact of the cleansing effort with the business sponsor, the project manager and the core team.
One of the goals stated most frequently for business intelligence (BI) applications is to deliver clean, integrated and reconciled data to the business community. Unless all three sets of data mapping rules, as discussed previously, are addressed, this goal cannot be achieved. Many organizations will find a much higher percentage of dirty data in their source systems than they expected, and their challenge will be to decide how much of it to cleanse.
Data Quality Responsibility
Data archeology (the process of finding bad data), data cleansing (the process of correcting bad data), and data quality enforcement (the process of preventing data defects from happening now and in future at the source) are all business responsibilities – not IT responsibilities. That means that business representatives – information consumers as well as data owners – must be involved with the data analysis activities and familiar with the source to target data mapping rules.
Because data owners originate the data and establish business rules and policies over the data, they are directly responsible to the downstream information consumers (knowledge workers, business analysts, business management) who need to use that data. If downstream information consumers base their business decisions on poor quality data and suffer financial losses because of it, the data owners must be held accountable. In the past, this accountability has been absent from stovepipe systems. Data quality accountability is neither temporary nor BI-specific, and the business community must make the commitment to accept these responsibilities permanently.
The challenge for IT and for the business sponsor on a BI project is to enforce the inescapable tasks of data archaeology, data cleansing and data enforcement to meet the quality goals of the BI applications.
Note: The data analysis step may be time intensive since many “battles” may rage among the business representatives as to the valid meaning and domain of data.
Although data cleansing tools can assist in the data archeology process, developing data cleansing specifications is mainly a manual process. IT managers, business managers, and data owners, who have never been through a data quality assessment and data cleansing initiatives, are unfamiliar with the analysis process. They often underestimate the time and effort required of their staff by a factor of four or more.
Source Data Selection Process
Poor quality data is such an overwhelming problem that most organizations will not be able to correct all of the discrepancies. When selecting the data for the BI application, consider the general steps shown in Figure 4.
Figure 4: Source Data Selection Process
Step 1: Identify data of interest and the significance of this data. Data cleansing is a collaborative effort between business representatives who are familiar with the semantics of the data, and data quality analysts who know the program-specific meanings of the data (e.g., use and meaning of a “flag” value, redefined record layouts).
Step 2: Analyze the data for content, meaning, and importance. Many organizations have accumulated massive amounts of data in files and databases. This data constitutes a prospective gold mine of valuable business knowledge and is potentially a good source for data mining. However, the quality of the data content must be assessed first, since mining dirty data is of little value.
Step 3: Determine which data to include in the BI application. Select only the data that will meet core business requirements. Even with automated tools, the cost of assuring data quality for an all-inclusive BI decision support environment becomes prohibitive for most organizations. Some questions to consider when selecting data are:
Is this data clean enough for decision support usage?
If not, can this data be [partially] cleansed? Do we know how?
Is the dirty data the reason for building this BI application? Is cleansing this data therefore mandatory?
How much effort will it take to figure out how to cleanse it?
How much will the data cleansing effort cost?
What is the benefit of cleansing the data as opposed to moving it over dirty?
What are the data quality expectations from the business community and from management?
Step 4: Prepare the data cleansing specifications. The IT staff, working with business representatives, will get to know the necessary business rules, which are needed to write the data cleansing specifications. In essence, this is a source data reengineering process.
Step 5: Select ETL and cleansing tools. Determine whether it is appropriate and cost-effective to acquire an ETL tool, a cleansing tool, or both. Examine the suitability and effectiveness of those tools. Some data cleansing specifications can be very complicated. Be sure the tools are capable of executing them.
Note: Automated tools do not eliminate the manual labor of source data analysis. Human beings can use the inference drawing capability during the process of data analysis, which the automated tools don’t have.
Data Selection Key Points
When identifying and selecting the operational data to be used to populate the BI target databases, some key points should be considered. Applying the source data selection criteria listed in Figure 5 will minimize the need and effort for data cleansing.
Figure 5: Source Data Selection Criteria
Data integrity – How internally consistent is the data? This is the most important criterion.
The greater the proportion of manually entered data (data keyed in with few or no data controls, edits and validations), the lower the integrity.
Programming errors also contaminate great masses of data – and do so automatically.
The lower the integrity, the greater the cleansing requirement.
Data precision – This is the next important criterion.
How is the data represented internally?
For numeric data, what is the scale and precision of the data?
For date data, how is it formatted?
Data accuracy – How correct is the data?
Are there edit checks in the data entry program?
Are dependent values crosschecked? For example, does the data entry program not allow an expiration date to precede an effective date?
Is there an operational process in place for correcting data?
Are calculated values stored? What, if any, mechanisms are in place to keep these values accurate?
Data reliability – How old is it?
What generation is the data (month-end, weekly, daily)?
Was the data obtained from direct sources or from downloads?
Is the source of the data known?
Is the data a duplicate of another data store; if so, is it current?
Data format – The closer the data is to the destination data format, the less the conversion requirements will be. From highest to lowest, the format priorities are:
Data from a relational database (DB2, Oracle)
Data from a non-relational database (IMS, CA-IDMS)
Flat files (VSAM, ISAM) are the least desirable
Note: Source data quality will only be as good as the enforcement of quality processes in the operational systems. Mandatory quality processes should include data entry rules and edit checks in programs. If those processes are not enforced, or don’t exist, data usually gets corrupted, regardless of whether the data is in a relational database or in an old VSAM file.
To Cleanse or Not To Cleanse...
Many organizations struggle with this question. Data cleansing research indicates that some organizations downplay data cleansing to achieve short-term goals. The consequences of not addressing poor quality data usually hit home when their business ventures fail or are adversely affected because of inaccurate data.
It is important to recognize that data cleansing is a labor-intensive, time-consuming and expensive process. Cleansing all of the data is usually not cost-justified – nor practical; but cleansing none of it is equally unacceptable. It is therefore important to analyze the source data carefully and to classify the data elements as being critical, important or insignificant to the business. Concentrate on cleansing all of the critical data elements, keeping in mind that not all data is equally critical to all businesspeople. Then, cleanse as many of the important data elements as time allows, and move the insignificant data elements as-is. In other words, you do not need to cleanse all of the data, and not all at once.
Cleansing Operational Systems
When the selected data is cleansed, standardized and moved into the BI target databases, a question to consider is whether the source files and source databases should also be cleansed. Management may ask, “Why not spend a little extra money and time to cleanse the source files and databases so that the data is consistent in the source as well as in the target?” This is a valid question, and this option should definitely be pursued if the corrective action on the source system is as simple as adding an edit check to the data entry program.
If the corrective action requires changing the file structure, which means modifying (if not rewriting) most of the programs that access that file, the cost for such an invasive corrective action on the operational system is probably not justifiable – especially if the bad data is not interfering with the operational needs of that system. Let’s not forget that companies didn't even want to make such drastic changes for the now infamous Y2K problem – they only made those changes when it was clear that their survival was at stake. Certainly, a misused code field does not put an organization’s survival at stake. Hence, the chances that operational systems will be fixed are bleak.
Data Analysis Activities
Data analysis activities do not need to be performed linearly. Figure 6 indicates which activities can be performed concurrently.
Figure 6: Data Analysis Activities
- Analyze external data sources: In addition to requiring internal operational source data, many BI applications need data from external sources. Merging external data with internal data presents its own set of challenges. External data is often dirty and incomplete, and it usually does not follow the same format or key structure as internal data. These differences must be identified and resolved during this step.
- Refine logical data model: A high-level project-specific logical data model should have been created during one of the previous steps. In addition, some or all of the internal and external data may have been modeled on other projects and may already be part of the enterprise logical data model. In that case, the representative portion of the enterprise logical data model can be extracted and expanded with the new data objects, new data relationships and new data elements. If the required data has not been previously modeled, a new logical data model for the scope of this BI project must be created. It should include all internal as well as external data elements.
- Analyze source data quality: At the same time as the logical data model is created or expanded, the quality of the internal and external source files and source databases must be analyzed in detail. It is quite common that existing operational data does not conform to the stated business rules and business policies. Many data elements are used for multiple purposes or are simply left blank. All these discrepancies must be identified and incorporated into the logical data model.
- Expand enterprise logical data model: Once the project-specific logical data model is relatively stable, it needs to be merged back into the enterprise logical data model. During this merge process, additional data discrepancies or inconsistencies may be identified. Those will be sent back to the BI project for resolution.
- Resolve data discrepancies: Occasionally, data discrepancies discovered during the data analysis step involve other business representatives from other projects. In that case, the other business representatives as well as the data owners will need to be summoned to work out their differences. They will either discover a new legitimate subtype of a data object or a new data element, which must be modeled as such, or they will have to resolve and standardize the inconsistencies.
- Write data cleansing specifications: Once all data problems are identified and modeled, the specifications for how to cleanse the data must be written. These specifications should be in plain English so that they can be validated by the data owner and by businesspeople who will use the data.
Deliverables Resulting from these Activities
Normalized and fully attributed logical data model: This project-specific logical data model is a fully normalized entity-relationship diagram showing kernel entities, associative entities, characteristic entities, cardinality, optionality, unique identifiers and all attributes.
Business metadata: The business entities and attributes from the logical data model must be described with metadata. Data-specific business metadata components include data names, data definitions, data relationships, unique identifiers, data types, data lengths, domains, business rules, policies and data ownership. These are usually captured in the tool repository of the CASE tool.
Data cleansing specifications: This document describes the cleansing logic that must be applied to the source data in order to bring it into compliance with the technical data conversion rules, the business data domain rules and the business data integrity rules. This document will be used to create the transformation specifications on the source to target mapping document.
Expanded enterprise logical data model: This deliverable is produced behind the scenes by data administration or the EA group when they merge the project-specific logical data model into the enterprise logical data model. Any rejected entities or attributes and any discrepancies between the models will be presented to the BI project team for resolution.
(Source: Moss, Larissa and Atre, Shaku. Business Intelligence Roadmap – The Complete Project Lifecycle for Decision-Support Applications, Addison Wesley Professional, 2003.)
Author’s Note: The Business Intelligence Roadmapincludes a set of all major activities and tasks that are appropriate for BI projects. Not every BI project will have to perform every single activity in every step. To receive a complimentary copy of the Business Intelligence Navigator, designed to help chart the business intelligence journey, please visithttp://www.atre.com/bi_navigator/navigator.html.