This article originally appeared on the BeyeNETWORK.
In my previous articles, we looked at the initial survey of all the significant automated data assets in the enterprise. These include: formal production databases; master files and transaction files of legacy systems; data on personal computers not subject to IT production disciplines; data owned by the enterprise but outsourced to service bureaus beyond the corporate premises; and data imported from external sources on a regular basis. We also looked at the technological and political barriers that must be overcome to gain read-only access and visibility to data assets.
This article looks at tables – either physical or logical – and also at what I call “table equivalents” not necessarily in RDBMS platforms. Table equivalents are sometimes revealed by record types in multi-segment files (such as the files often transferred between computers and between enterprises). When any file or data flow has multiple record formats, that file probably contains distinct subject entities related to each other in some way (which constitutes a mini-data architecture). If placed in a normalized relational database, these entities would each be mapped to separate tables. Hence, I call them “table equivalents.”
What is it necessary to know about a table? In addition to a variety of statistics and metrics, it should be determined what subject entity (entities) is (are) described by each table. A thorough analysis of the actual data in a table in a “production” database (in this sense, meaning currently active in supporting business activities and transactions) will probably reveal more than one business subtype to the primary entity described by the table. It is important to understand those subtypes.
Obviously, the first thing to know about a table is its size. That means both how many columns and how many rows it contains. The number of columns can be obtained from the directory (if in a DBMS) or otherwise from a reliable file description. The number of rows can only be accurately determined by actual observation. Hence, this requires posing a query against the table. This, of course, requires read-only access to all the tables in a database of interest and, ultimately, to all the databases in the enterprise.
Is it only the column count that is of interest? Not quite. The number of columns may not reflect the number of actual fields or data elements in a table. Some columns may contain multiple fields or sub-fields. For example, a 20-character customer number will very likely contain some sub-fields. This is why native SQL alone may not be the best query tool to use when conducting analysis. It is necessary to use a query tool or reporting language that allows for “testing under mask” or subdivision of a longer field into several shorter ones. This technique may apply to numeric fields as well as character fields
It is also necessary to determine the business meaning of the data, not just the physical presence of rows and columns. The business meaning may be more vivid in sub-fields. Therefore, the query tool needs to allow sorting and aggregation on months and years, not just on raw dates. Or, if every date is carried as a date-time (as is the case in some RDBMS implementations), it is necessary to be able to conduct analysis on just the date (with the hours, minutes or seconds truncated or masked). These techniques will be discussed in further detail in subsequent articles.
Certainly it is necessary to determine the number of rows. So, for example, assume the record count reveals that there are 45,927 rows. That is somewhat useful, but a single statistic like that lacks context. Is that good or bad? Two general questions immediately come to mind:
- Is that number of rows growing or declining, and at what rate?
- Are all the rows created equal? Of equal importance? Of equal standing? Are they all peers to each other (logically)?
Having a row count of a legacy database table is a good start to understanding how much data there is, but it would be far more useful to know how the row count has changed over time as the result of business activity. Few application designers ever give much thought to the size of the tables. A good database administrator (DBA) must do so. Some DBAs do track (often informally) how large the tables are at various points in time, and some data managers or DBAs track table size on a regular basis. Some organizations even have automated applications to monitor table size.
Managing Table Growth
Rigorous and regular tracking usually reveals certain patterns. There are tables which grow slowly – generally describing kernel entities which are “slowly changing” and also somewhat tangible entities – such as people, customers, employees and cost centers.
Then there are tables that carry transactions and business events. These generally grow at a higher rate. In large, mature enterprises, these tables would eventually grow to unmanageable size if some kind of “pruning” process did not take place. While this may have been anticipated by the original application designers, it is usually something that an experienced DBA must address, either before the application starts running or early in the life of the application and its supporting database.
Is it necessary to keep records forever? When can purging begin? How long records must be kept can be a complex business question. The purging of records (or perhaps more correctly, the archiving of them, one would hope) should be driven by business policy. The most likely decision rule is how long it is legally necessary to keep the records online (that is, to support reasonable online transactions), or perhaps how long (by business policy) before the contents of such records cannot be changed. These are two distinct questions. The latter is often difficult to discern during the design of the application, and it is often ignored. Superficial analysis often
The latter is often difficult to discern during the design of the application, and it is often ignored at that time. But as transaction files grow, and space and performance become an issue, the very act of designing a purge/archive function may force record retention questions back for further business review. Business policy (how long to keep records on the active file) may need to change to be pragmatic.
In many customer-transaction systems, keeping a history of transactions becomes important. The history contains important information such as noting when a customer has not been active and risks becoming dormant, or the history may be used to evaluate patterns of customer behavior to determine the risk of archiving that customer record (although that is best done in a data warehouse).
The employee-customer dialogue may legitimately include comments such as: “We haven’t seen you for a while” or “You have been making a lot of small withdrawals.” To support that kind of “intuitive” feel for the customer’s recent behavior, the data must be available and displayed in some viewer-friendly manner. This is one of the considerations of crafting a good purge or archive policy – and designing a user-friendly online application.
In surveying the wide variety of tables in typical databases, two patterns of table growth become visible. One is where there is active purging. Figure 1 shows an invoice master file of a mature organization, where a purging function has been imposed. Notice that in spite of the “sawtooth” pattern, the general trend is upward, showing moderate growth in business activity.
Figure 1: Table size pattern of invoice file in mature enterprise
On the other hand, as shown in Figure 2, the customer master file in the same organization shows slow, but steady growth.
Figure 2: Growth of customer master table size in mature enterprise
If that table size chart has a significant discontinuity, as shown in Figure 3, it may be necessary to understand why that is occurring.
Figure 3: Customer master file with trauma or discontinuity
In this case, there was an acquisition of a similar company. The reasons for discontinuities should be business reasons, not technical. If they are technical, then it should be determined how this affects the business.
The table name does not always reveal what is in the table or what business entities the data in the table fundamentally describes. Part of a high-level survey of any database and the tables it contains involves knowing what kind of business-subject entities each table describes. The following questions are suggested for this part of the inventory:
- What kind(s) of things do records in the table describe?
- Does the table contain/describe just one subject entity or several?
- Are all the records logical peers to each other?
- What anomalies exist?
- Are there significant business subtypes in the table?
- Are instances described by the data mutually exclusive, or are there duplicate or quasi-duplicate records?
These are important questions, and they are business-oriented questions, not technical. Technically oriented programmers may not care. They may not even think to ask these questions. The answer to these questions generally is irrelevant to how the program works, and many anomalies may not cause abends. Like nearly all other data inventory activities, this requires looking at the actual data in the table.
Mature business systems supporting large, complex enterprises tend to have several significant (or “kernel”) subject entities that, over time, develop some interesting anomalies and subtypes. Following is a list of the typical kinds of subtypes:
Customer Master File:
- Domestic vs. international customers
- Prospective vs. active vs. former/dormant/inactive customers
- Outside customers vs. internal (to the organization) customers
- Consumers (i.e., households) vs. business customers
- Corporate accounts vs. divisions (subordinate) or locations
- True, elemental customers vs. artificial customer groupings
- Non-customer placeholders
Product Master File:
- Consumer products vs. industrial-sized products
- General vs. specialty
- Company-owned brands vs. products packaged for other companies
- True, elemental products vs. product groupings
- Prospective products not ready for sale
- Products no longer offered for sale
- Non-product placeholders (supporting some kind of business process)
- Tangible products merged with services
Invoice History File:
- True original invoices vs. invoice reversals vs. credit memos
- Shipments to customers vs. interplant transfers
- Shipments resulting in revenue vs. free goods
These subtypes can be important. They have business significance, and they are important when evaluating the quality of the data in a table. Some subtypes may require unique treatment in archiving or purging. Of course, detecting these subtypes requires use of a query tool and possibly a vendor-supplied data quality tool.
In my next article, we will take a slight detour to talk about “duplicate data.” Then we will look at customer master files and the variety of subtypes that they may contain.