This article originally appeared on the BeyeNETWORK
According to the cosmologists, only some 5% of the matter in the universe can be accounted for by what we can observe – mainly stars and nebulae. The remaining 95% is due to something called “dark matter”. Dark matter cannot be observed, and just what it consists of is a matter of speculation. Nevertheless, that it exists is quite certain since we can calculate its mass based on the rotational velocities of galaxies. It would seem that the cosmologists have a considerable advantage over enterprise information management (EIM) since they can quantify what they cannot see. When it comes to data, just how much of it is hidden within the enterprise is extremely difficult to ascertain. Yet we all know it is out there.
The focus of traditional data management is relational databases and the applications that interact with them. It is true that earlier generations of technology can still be found in the data landscape, such as VSAM and ISAM. However, these living fossils predate the explosion of data management that was intimately connected with the advances in relational theory and the deployment of products based on it. In any case, ISAM, VSAM, and the like can only be managed by applications that have been designed and implemented at the enterprise level. The data contained in these formats may be difficult to get at and manipulate, but its existence is known.
By contrast, there is a lot of data that is not visible to data administrators. It exists primarily in personal files whose content is managed directly by individuals rather than by any corporate applications. This is dark data.
Although little is known about the extent of dark data, it would seem that the majority of it is contained in spreadsheets. Nevertheless, applications like Excel are not alone. There are also scanned images, Word documents, PDF files, and even applications like PowerPoint. Interestingly, none of these may be sophisticated enough for some users, and there is evidence that Access is now being utilized as a more powerful version of Excel.
No matter what format the dark data resides in, it exists as a shadowy world that is largely unknown to enterprise information management in particular and IT in general. There is also reason to believe that the volume and scope of dark data is growing all the time.
The role of spreadsheets in enterprises is quite problematic. Their place in the general scheme of things dates to the personal computer revolution. Prior to the advent of PCs, users could not really process information independently, except on paper. In those days, mainframes provided functionality which was often implemented as batch processes. Where screens were available, they were typically “dumb”, green, 3270-type terminals. It is from this bygone mainframe era that we get the foundation of our current systems development methodologies. There are no categories in this framework of thought for personal data processing, and so it does not recognize dark data.
Today, the focus of data management is on shared relational databases. Responsibility for data – part of data governance – to the extent that it is encouraged at all, is only discussed in the context of these databases. The concept that users may be responsible for data they produce in spreadsheets is rarely thought of. At a higher level, there are no technologies needed to coordinate the use of spreadsheets. Thus, IT in general is not particularly concerned about them, expect perhaps from the perspective of license issues.
Spreadsheets and other forms of personal data processing, therefore, lie outside the realm of what data management deals with.
Sources of Darkness
Where does the dark data that gets into spreadsheets come from? There appear to be four major sources:
- Data obtained from corporate databases (e.g., copied from screens, or saved from reports produced on screens).
- Data directly produced by users themselves. This may include master data (e.g., a user who first makes contact with a new client may store the details of the client in a spreadsheet).
- Data “scraped” from the Internet. Today, this can be just about anything – from exchange rates to telephone numbers.
- Data computed in spreadsheets. Vast numbers of financial and other models are contained in spreadsheets. I can attest to the fact that these models are sometimes used to manage assets worth billions of dollars.
There are several fairly obvious problems with dark data. A major one is data quality. Errors that can occur in transcription are unlikely to be caught. Also, a user may be unaware of quality issues in the data sources they are using. This is particularly true for sources outside the enterprise, such as Internet-based sources. It should be a principle of enterprise information management that the source of any data must be known. While this is rarely enforced for data in shared relational databases, it is often possible in a pinch to figure out what the source is. For dark data, it will be impossible if the user who produced it is not certain of the source. It is easy to see that there are potentially significant compliance issues in dark data.
There are also related problems. Does the user who captures the dark data really understand its semantics? If not, then decisions or reports based on the dark data may be problematic. What about copying private or confidential information into spreadsheets? There is often nothing to stop it.
Then, there are the computations inside spreadsheets. Additional dark data may be generated using logic that is not understood by anyone except its creator. Financial regulators in particular are becoming more aware and concerned about the risk inherent in spreadsheet models.
Producing dark data is one thing, but there is more. Most data and IT professionals are familiar with diagrams of the point-to-point interfaces that have grown up in many enterprises. These diagrams show how data moves from one database to another, by such mechanisms as: database replication; extract, transform and load (ETL); and messaging. However, there are shadowy kinds of integration that are not shown on these diagrams because they are carried out by humans and not technology.
One form of this kind of integration is where a system operator looks at an output of one application to obtain values that are then data-entered into another application. In my experience, this technique is often applied to master data. Data for entities like Customer and Product mysteriously spread through the databases of the enterprise in this way. Of course, the technique is risky. Coordination of the source and target has to be done in the head of the user performing the task. Important aspects of data quality may also be at the discretion of the user. The issues of latencies and cycles involving batch and real-time applications can add to the risk in this kind of “dark integration”. If a Customer or Product record is (or is not) manually created in an application at one point in time, it may cause a problem. Understanding how these cycles impact dark integration is by definition only going to be available through tribal knowledge, since dark integration is a technique rarely supported by IT.
Data professionals can spend a lot of time analyzing the automated interfaces between applications. The efforts, however, will not show the contribution of dark integration to data flows. Basing decisions on such analysis, therefore, includes a certain amount of unquantifiable risk.
While it may be difficult to quantify the extent of dark data and dark integration, there are hints that it is large and important. I have been mystified as to how some of the enterprises in which I have worked continue to function given the state of their application and data landscapes. The only conclusion that I can come to is that there is something else going on in terms of information management that compensates for the very real limitations that exist. Somehow, the users are getting around the problems, and it can only be through dark data.
The denial, or ignorance, of EIM concerning dark data is not going to survive in the long term. Data is increasingly being recognized as both valuable and a source of problems for enterprises. The focus of data management is beginning to swing from the logical world of models to the physical world of data values. Responsibility for data is becoming ever more personal with individual responsibilities being emphasized rather than anonymous applications or the enterprise as a whole. Dark data will inevitably be challenged and it will be up to EIM to tame it.