Data Archeology

Data archeology has always been a difficult and time-consuming task, but the good news is that there is a new generation of semantic discovery solutions that can radically change how the job is done.

This article originally appeared on the BeyeNETWORK.

The Importance of the Meaning of Data
The number one problem today in turning data into something useful to the enterprise is the lack of understanding of that data, both what it means on a fundamental level and the context behind it. Let me cite a few examples we are all familiar with:

  • Mars Orbiter Problem: Data was assumed to represent one unit of measure, but it was actually stored as another (miles vs. meters); hence, millions of dollars of equipment was lost. 

  • Bombing of the Chinese Embassy in 1999: The map was out of date. 

Public corporations and government agencies face a yearly challenge of producing an annual report, which is filled with counts and statistics. The assumptions around these counts can be misunderstood and easily misconstrued. Consider the simple problem of counting how many customers you have. What is the definition of a customer? You can easily arrive at different counts based on different definitions of a customer. Does a customer of a software company on maintenance of a previously purchased license count as a customer, or do you only count new customers? What about repeat customers? Or, do you define new terms that specify these various classifications?

Extrapolate even further: Business intelligence (BI) tools help businesspeople make financial decisions by looking at statistics using multidimensional analysis. It is highly likely that at least one cube will have hidden assumptions about the meanings of business terms. Misunderstood data may unwittingly cause many errors in judgment because the business terms used are not clarified and their definitions are not made readily available.

Semantics: The Study of Meaning
The definitions, context, assumptions and rules surrounding business concepts are the semantics. In information systems throughout the years, we have done a poor job at capturing these semantics. (Remember how we all absolutely hated to do documentation? Now it’s coming back to bite us!)

Every business system has its own unique semantics. In addition to the system’s semantics, every user has his or her own semantics. For example, on one migration project, I discovered an entry in a date field of 25/25/2525. We all know that no database management system on the planet will accept this entry in a date field. It turns out that one of the developers liked the song “In the Year 2525”, so he used this as a code in the field to mean something like “date not available.” Thus, this guy essentially had his own semantics.

So, as a good consultant, when I realized this, I began to help my clients build systems that kept track of these semantics. I saw definitions and business rules as two very important items. I helped my clients build repositories that store them both. Usually these repositories were homegrown because the traditional repository tools didn’t store business rules or any type of business metadata very well. In addition to storing this information in homegrown repositories, I also found that these rules required a grammar that was both visual and semantically rich. I invented my own process for depicting these semantic nuggets or building blocks, and it was very complex. I was attempting to list all the various ways of representing each semantic concept, but because I didn’t have any tools that inherently understood semantics, what I delivered had to express the concept in several different representations in order to be verified. Our method had to be translated into parsing code (we used Perl), and we had to manually search using SQL to verify that we had found all occurrences of the semantic concept. It was extremely labor-intensive.

This article presents a case study of this project and how we struggled through the semantics discovery process. It also and introduces some tools and techniques that have been launched recently that can greatly streamline this process.

Fundamentals of Data Migration
Data migration is performed whenever data is moved from one place to another. Data migration is involved in the following situations:

  • Data/system conversions, including conversions to customer relationship management (CRM) software, financial management software or any other packaged software
  • Data warehouses
  • Integration efforts such as enterprise information integration (EII) projects
  • B2B transactions and supply chain integration
  • Data sharing and reuse between any application and any organization

Data migration is extremely common. Mapping is inherent in all data migration efforts. Data mapping is the fine art of illustrating which source fields get moved into which target fields. The source is always provided, in the form of your current system(s). Sometimes the target is provided for you, in the form of a commercial off-the-shelf (COTS) tool such as CRM software or financial software; sometimes you are designing the target, such as a data warehouse or a new application that is homegrown.

Mappings are very important. If the mappings are not correct, the entire project is in jeopardy because the data will not be in the right places and the application cannot function properly. Because the mappings are so important, it is also important to not only get them right, but make sure everyone else agrees with you that you have gotten them right.

Semantics and Mappings
Mappings are really all about semantics. What you are dealing with are nuggets of meaning everywhere. You want to verify that you map together two nuggets that mean the same thing.

There is good news and bad news here. The good news is that there are usually fewer unique nuggets of meaning than there are unique patterns of how the same nugget can be expressed. For example, here are six different patterns that express the same unique land description:

  • Lots 4,5,6
  • L4-6
  • L4,5,6
  • L4 L5 L6
  • L4 5 6
  • GL 4-6

The bad news is that we usually approach this problem in a pattern-matching way. All our tools perform parsing. So even though all of these strings are really expressing one concept, we code the patterns.

What this means is that our migrations are extremely complex, time-consuming and, consequently, costly. The Data Warehousing Institute has estimated that 70% of most data warehouse budgets are spent on integration and extract, transform and load (ETL).

The following traditional approach to migration, although thorough, will demonstrate how complex migrations can be.

Mapping Prerequisites

Data Profiling
The domain (or list of values occurring in each field) of each element must be documented. Data profiling is the systematic, disciplined and repeatable process of obtaining metadata about the “contour” of each field in the database – facts such as how many nulls are in the field, what percentage of the overall data this represents, how many unique values are in the field, what type of data is in the field, the top occurring patterns and how often each occurs, etc. One of the consequences of not knowing a source element’s domain in advance is that the data does not fit into the fields the first time a load is attempted. This situation results in your load bombing; it is called “load, code and explode.” Data profiling can help avoid this pitfall.

You will have to perform data profiling either before or after loads are attempted – guaranteed. It is much more painful to perform it after attempted loads because you are performing it as a troubleshooting endeavor. If data profiling is done in advance, you will be able to prepare for each condition properly. In addition, you can use the list of values as a quality check for complex mappings. You can ask yourself, “Have I covered all cases?”

I have always believed that you should never even attempt a data warehouse, integration or migration project without doing data profiling first. If you do, you are doing so at your peril.

Database Design: Verify from a Business Perspective
The database designs of both source and target must be totally understood, especially the business ramifications of one-to-many relationships.

For example, when multiple items are entered in a single field, separate identities can be “mushed.” Does each element merit its own row, or are the elements part of a set? This needs to be worked out with the business.

When you are moving data, you want to make sure you are moving it correctly, and this means making sure that you – and the application folks – understand the model.

Business Rules and Legacy Implementation Rules (Artifacts)
The semantics or meaning of the data must be understood because this dictates which data elements go in which target fields. Business rule analysis helps to understand the underlying rules about the data. Legacy artifact “rules” or special codes are uncovered in this ferreting process. Complex data dependencies will also be uncovered which will serve as templates for transforms.

Mapping Process and Pitfalls Due to Semantic Evolution
Once these prerequisites are complete, then mappings can begin. It may sound easy to simply map each source element to each target element, but anyone who has been involved in any legacy system migration knows that it is not usually very straightforward, especially if you are dealing with older systems utilizing a database paradigm other than relational.

The most common problem with older systems has been database change, which has been next to impossible with network or hierarchical databases. Because these systems were so difficult to change, users became very creative about how they stored data. There were not enough fields and a new field could not be added, so users devised strategies in which one field could hold three or more distinct pieces of information. Special codes were created, and data was embedded in fields in creative ways. These are instances of specialized semantics to express data that was not understood when the system was designed. Therefore, the meanings of fields often evolved over time.

Troublesome Mappings
This basic limitation of legacy systems many times results in one field being split into three or four (and sometimes more!) fields. Sometimes these splits are easy, such as the phone number example. Although we do not normally split up phone numbers, there are three separate pieces of a phone number (four, if you consider international phone numbers). If you were to split them up, it is a simple parsing algorithm: the first three characters are the area code, the next three represent the local exchange and the last four are the unique number. There are no value-based branches that must be taken, just a simple positional parse. However, consider what must happen when area codes are not stored. They must be inferred from location data, which can be very problematic within regions with more than one area code.

However, many times special codes were created, acting as abbreviations, because the field (which was not meant to store so much information) was not long enough. Sometimes data would even slop over into an adjacent field! There was a special code to indicate this. Here are a few special codes that appeared in our system migration effort:

  • 2L meant “1/2” was added to the end of the designator field for a government lot.
  • N2 meant “north half” gets specified as a partition, so a new row was created in the subdivision table.
  • 1 at the end of the survey number field sometimes meant the land description is a duplicate (e.g., there are two different government lots with the same name).

It would be okay if this was all we had to deal with. However, sometimes categories occurred in the business – and different rules applied to each category; different codes applied, different fields were mandatory and fields had different domains.

I have coined a term for this type of situation: value-based dependencies. This means that different dependencies are active based upon a given value in a field. Sometimes it can also be the interaction of two or more fields that causes the dependency.

The Complexity of Mappings
There are different classes of tools that get the job done, and they vary in terms of their adequacy. You don’t really realize how complicated mappings are until you are knee-deep in them and the tool you have chosen is insufficient. Join the club –you are not alone!

I have found that for complicated situations, such as value-based dependencies, the traditional tools don’t allow the mapping to be expressed adequately. The next section surveys the different types of tools and pinpoints how complex mappings can be expressed in each, if they can be expressed at all.

Survey of Tools

ETL Tools
ETL tools are billed as “all-purpose” because they supposedly handle the three important tasks of data movement: extract, transform and load. However, I have found that they do not go far enough for complicated transformations; thus, you will have to resort to writing custom code.

Fixed Target Mapping Tools
Evoke (now part of Informatica) had a tool called FTM or Fixed Target Mapping. It was meant to hook nicely into the environment where you have done your data profiling using Axio to analyze data dependencies, and then continued to use Axio to generate a conglomerate model of all your source systems. After these tasks were completed, you fed the tool a fixed target model (such as the data model used by Oracle Financials), and the next step was to perform a mapping from the Axio-generated source to the fixed target.

I did not use Axio in this fashion. I had performed the data profiling with a prior version of Axio called Migration Architect. I did not perform all of the sequential tasks to produce a data model. In my particular project, we were analyzing only one table, and the dependency analysis really didn’t fit into the analysis we were performing. Thus, we didn’t have a system-generated source; and, in addition, we didn’t have a “fixed” target. I was the designer of the target, which was nice because if I found a problem and a field was left out of the target design, I could simply add it in.

FTM did the job for all of the “straightforward” field mappings (see Figure 1). The thing I liked the best when using FTM was that I could flip the model and view it from either perspective: source or target. When you are completing your mappings and trying to verify that if you are done, one of the things you can do is “flip” your mappings. Most people find it useful when designing mappings to design them from source to target. You can then look to see if you have covered all source fields. But what about all the target fields? If your tool enables you to flip it, and view it from target to source, you can then see at a glance if all the target fields have been mapped. This is a handy feature.

I used Brio to create reports. I actually shipped the Brio report into Excel and created a spreadsheet showing the mappings. I ran into two trouble spots when performing these tasks; one was in understanding Axio’s metamodel – you must be able to decipher many layers of relational design. The second challenge involved an older release of Oracle which doesn’t support multilevel simultaneous outer-joins. This causes problems when you want all the rows in table A, table B, table C and table D joined together, regardless of whether they have child rows in their related tables or not. I had to get the job done doing separate queries, which was a pain.

Figure 1: Axio FTM Screen

When Traditional Tools Don’t Cut It…
It was when I got into the categories and the value-based dependencies that things got really hairy. I had a difficult time expressing these mappings using FTM because the source-to-target varied so widely, based on the data categories. This is when I resorted to other tools.

MindManager
I used MindManager for one project. One of the biggest advantages of this tool was the ability to flag troublesome transforms graphically with little symbols. This calls the programmer’s attention to the fact that “something tricky is going on with this part – pay attention!” It also enables you to visually see the fact that one field is going into multiple target fields (see Figure 2), and you can hide the complexity as a text field; it only appears when you print it. On the downside, it doesn’t provide the ability to look at the transforms from either perspective: source to target, or target to source. MindManager, by its very nature, is one-sided. Granted, I am using it for something it normally is not used for; its main purpose is for organization of thoughts or performing an orderly brainstorming session. However, it is certainly not well suited for very complicated value-based dependencies.

Figure 2: MindManager

Visio
The only tool that I found helpful for extremely complicated transforms was a drawing tool; in my case, I used Visio, but any drawing tool will do. I did not use any of the special modeling capabilities of Visio, I just used the flowcharting stencil.

Flowcharting is simple, and it appears to be a fairly universal “language.” Both businesspeople and technical people can read basic flowcharts. Because in complex mappings you are diagramming a process (or usually a series of processes), flowcharts seem to work well.

What I did was split the complex transform into parts. I would try to move from the simplest case to the more complicated ones. Sometimes, if a specific process should be repeated for each “case” (what I called each step of the transform), then I would illustrate these first.

Figure 3 shows a series of cases to be performed. In our situation, transforms were based upon the interaction of data category (here called “survey types”) and legacy subsystem. A different flowchart was produced for each combination of subsystem/survey type.

Figure 3: Visio Flowchart Transform

In addition to the flowcharts, Visio was also used to display the complex field maps referenced in the flowcharts. As you can see in Figure 4, the combination of sur_nr and suffix fields determined some values, aliquot part determined the creation of a new row, and the combination of sur_nr, suffix and aliquot part also determined some values. These complexities cannot be understood with just the field map; the flowchart transforms are also required.

Figure 4: Visio Field Map

Notice in Figure 3 that each numbered section represented a semantic nugget. These diagrams simply say “Here’s how you recognize a private land claim with no subdivisions.” Note also that these are just standalone drawings that attempt to express the semantic nuggets graphically so everyone can agree that this is indeed how they are expressed in the system, and the programmer can then use them to write code. However, there is still another step that must be done: a programmer must still write the code! These diagrams don’t do anything, they are not code in and of themselves (wouldn’t it be nice if they executed without code needed? Keep reading!).

Map and Load is Iterative
Using traditional tools and pattern-matching approaches to migration, you have to be resigned to the fact that invariably you will not get the mappings right the first time. The first time you attempt a load and hold your breath, nine times out of 10 something will go wrong. This is totally normal. You will probably have to adjust the mappings. This is okay. But this is why your mappings should be easily understood by all parties: businesspeople, analysts and programmers. You should be able to refer to the mapping to understand what is intended, and then figure out what the problem is. If the mappings are well understood, then troubleshooting will be easier. Updating them should also be easy.

A failed load is a friend in disguise, helping you to get it right. The insidious problems are always the ones that appear to work, and then months down the road when everything is in production, the users realize that there is “something wrong with the data.”

But there may be a better way than map and load, remap and reload, ad infinitum…

Semantic Discovery
Part of the problems with my old-school tools described are that they treat profiling, discovery and restructuring as separate tasks. Imagine if data profiling could be performed at the same time as semantic discovery and that semantic discovery could lead seamlessly to data standardization.

The other part of the problem is that the old-school methods look for simple syntactic patterns rather than semantic concepts. Imagine if you could have a tool look for all the instances of these semantic nuggets and then display only the data it could not recognize. Then you can have the subject matter expert (SME) looking over your shoulder saying, “Oh! I recognize that one – it’s another way of expressing a Texas Tract.” In essence, the tool is therefore performing data profiling and semantic discovery all in one step. Imagine all the time you will save!

Because I performed this project before any semantic discovery tools were available, I don’t have a direct comparison using the same data. However, the semantic discovery problem is universal, and it can easily be seen in the following examples how our project could have benefited from this new technology.

Semantic Discovery in Action
When you begin using a semantic discovery tool and you have it analyze a specific field, it will try to recognize semantic nuggets, as seen in the following series of screenshots (Figures 5 through 9). The first step is to read-in some sample data. At the outset, the system does not recognize anything, but the good news is that it can quickly be taught to understand even complex and esoteric domains.

In this example, we have a set of product descriptions representing motors. Some of the descriptions are easy to read, some are more cryptic, some contain only the required information and some are intermixed with irrelevant text. Our task is to understand the items, extract key attributes and standardize the descriptions into a usable form. This would be a typical task in data integration and one that could easily drive someone like me completely mad – until now that is! Let’s look at how we can now tackle this with semantic discovery.

Figure 5: No Data is Recognized Yet 

The first step in understanding the sample data is to load common, foundational definitions such as units of measure. Any commonly found data elements and relationships for a given subject domain can be imported in a similar way. Notice that the system immediately understands 62% of the data, which it shows in white. What’s more, it recognizes the information in context so that abbreviated and potentially ambiguous terms, such as “HP” and “RPM” are not misunderstood. Using my old-school methods, this step alone could be a significant programming task.

Figure 6: The Beginning of Semantic Discovery

The next step is to add more intelligence or understanding to the data. Figure 7 shows the impact of adding intelligence on only the first line and applying the learning to subsequent lines – overall understanding jumps to 78%. This is because the inherent semantic patterns are highly reusable (as opposed to the syntactic patterns which are affected by word order, abbreviations, punctuation, etc., and tend to vary line by line). By simple point-and-click operations, the contextual meaning of the data can quickly be taught.

Figure 7: Assigning Meaning to Components

Now that the semantic nuggets have been identified, the data can easily be restructured into any form required.

Figure 8: Attributes Can be Extracted and Items Restructured

This technique not only streamlines data discovery, it also facilitates translations of all kinds. Mappings become just another type of translation. In addition, you can use this technique to translate a cryptic string into well-ordered English that makes sense (see Figure 9). If you are an international company, you can translate from any language into any other language through these semantic nuggets.

Figure 9: Items Can Be Standardized and Translated On-the-Fly

The previous screenshots represent how the system can be “taught” to recognize semantic nuggets and apply them to data transformation, but this system can also be deployed at run-rime to transform data on-the-fly to automate real-time operations.

Author’s Note: The screenshots in this article are from the DataLens System by Silver Creek Systems. I’m very excited about this tool. It has enormous potential to streamline data migration tasks, cut labor costs and increase ROI by orders of magnitude. It has the potential to make my job as a consultant more effective and productive, save my clients time and money, make my clients very happy and me a heroine!

Data mapping is all about semantic discovery and translation. All along we have needed a tool which enabled us to teach it semantic components, not raw patterns. New semantic discovery tools such as Silver Creek’s DataLens System may be just the thing to supercharge data migration and integration. Semantic tools maximize the time of business analysts and SMEs. Semantic discovery is the “next big thing” for data migration!

Many thanks to Martin Boyd at Silver Creek for his assistance and edits.

  • Bonnie O'Neil 
     

    Bonnie is President of Westridge Consulting, and is an internationally recognized expert on data warehousing and business rules. She is a regular speaker at Meta Data/DAMA Conference, Software Development, Database World, Guide, and the Business Rules Forum; she was the keynote speaker at an international conference on Data Quality in South Africa.  She is a founding member of the Guide Business Rules Group (a standards group for business rules) and also the ODTUG Business Rules Summit. She has been involved in data warehousing projects in both Fortune 500 companies and government agencies, and her expertise includes specialized skills such as data quality, profiling, data integration and migration.  She is the author of two database books includingOracle Data Warehousing Unleashed, as well as over 40 articles and technical white papers. She is a Certified GIF Architect by Bill Inmon, the father of data warehousing.

 

Dig deeper on Business intelligence architecture and integration

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchDataManagement

SearchAWS

SearchContentManagement

SearchCRM

SearchOracle

SearchSAP

SearchSQLServer

Close