The term data lake has gained quite a few followers of late. I am explicitly not one of them. Words mean something. Phrases, especially when used in an architectural context, convey images that should ideally tell us something meaningful about the topic. So, what would you infer about the structure and the creation, management and use of data if I told you it was in a lake?
First, in case you're new to the phrase, a short explanation is in order. It appears that the term dates back to 2011, first used by James Dixon, CTO at software vendor Pentaho. Since then, it has been promoted by people like Dan Woods of CITO Research and Edd Dumbill, vice president of strategy at consultancy Silicon Valley Data Science. Of more interest, perhaps, is data lake's growing use by various vendors in the big data space and the adoption for marketing purposes of variants like business data lake by the tandem of Capgemini and Pivotal (EMC's cloud computing and big data software spin-off) and enterprise data lake by Hadoop vendor Hortonworks.
That's the history. But what is a data lake? In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications. At its fullest extent, it amounts to a rip-and-replace strategy for all data warehouses, data marts and eventually even operational databases.
Dumbill suggests that in the latter stages of its evolution, all new applications will be built on the Hadoop data lake, all applications will share data there, and data governance and security processes will be applied there; only a few legacy or specialized applications will stand alone, he predicts. Other writers envisage a longer-term coexistence picture. Let's leave aside the obvious logistical and funding issues of a rip-and-replace approach and focus on the fundamentals. How exactly would one architect a data lake?
Data lake view paints the wrong picture
The phrase doesn't help much -- and that is one of my key issues with it. In a lake, all water is essentially equal: It flows about without constraints inside the lake's banks, its exact source is unidentifiable, and anybody can dip in a bucket and take some of it. But applying such characteristics to data leads to an architectural picture that is completely inappropriate for business data. So, why was the phrase chosen?
I suspect that it was to contrast with the highly structured, well-organized image we have of a data warehouse. But while we may be looking at an explosion of unstructured (or semi-structured) data, that doesn't mean we need a completely unstructured data store (i.e., a lake) for it. And, more important, we certainly should not consider taking data that has previously been carefully understood, modeled, structured and managed and pouring it into a lake of data of unknown provenance.
Rather than inventing new marketing-speak, I believe we must address how these very different types of business data can coexist and contribute to the creation of business knowledge. Although some of the concepts and requirements that drove the creation of the data warehouse architecture are no longer applicable, there is a strong and permanent need for a core set of data that defines the state of the business. Such process-mediated data demands a highly structured and regulated data store.
There is also a growing set of requirements for loosely defined and frequently changing data, which can be used to sense trends as part of an effort to anticipate the changing demands on the business. Such machine-generated data and human-sourced information demands an enormous, low-cost and agile data store. (For further details of this tri-domain information model, please refer to Chapter 6 of my new book Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data.)
A different architectural image: Standing on pillars
Although highly structured and agile data environments are very different from one another, there is a strong requirement to be able to relate them to one another. The insights derived from either one on its own are far less useful than those derived from their combined information. I see the resulting architecture as one consisting of a number of technological pillars, each optimized for a particular need and type of processing, but all interlinked through assimilation processes and metadata (or, as I now prefer to call it, context-setting information). That is a very different image than a lake.
No metaphor is perfect. I recall when we discussed the term data warehouse back in the mid-'80s, we worried that it sounded like an unfriendly place for business users. Indeed it was, and the data mart was introduced to address that, even though the mart metaphor has its own shortcomings. However, there exists a fundamental cognitive issue when we start to use wholly inappropriate metaphors to describe the conceptual underpinnings of an architecture. The term data lake creates extensive and probably unintended cognitive dissonance. It does a disservice to those who are trying to define a new architecture for data, something we seriously need.
Data lake is a messy and mindless term. I suggest we dispose of it. Or, should I say, drain the swamp?
About the author:
Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. His current interest is in the wider field of a fully integrated business, covering informational, operational and collaborative environments. He is the founder and principal of 9sight Consulting; email him at firstname.lastname@example.org.
Advice: Move carefully when choosing big data tools
Learn how big data and BI data differ -- and why the gap needs to be bridged
Find out why big data has changed analytics strategies, but not the need for sound data governance