Kit Wai Chan - Fotolia

Data lake muddies the waters on big data management

Barry Devlin argues against using the trendy term 'data lake' to describe a big data store for structured, unstructured and semi-structured data.

The term data lake has gained quite a few followers of late. I am explicitly not one of them. Words mean something. Phrases, especially when used in an architectural context, convey images that should ideally tell us something meaningful about the topic. So, what would you infer about the structure and the creation, management and use of data if I told you it was in a lake?

First, in case you're new to the phrase, a short explanation is in order. It appears that the term dates back to 2011, first used by James Dixon, CTO at software vendor Pentaho. Since then, it has been promoted by people like Dan Woods of CITO Research and Edd Dumbill, vice president of strategy at consultancy Silicon Valley Data Science. Of more interest, perhaps, is data lake's growing use by various vendors in the big data space and the adoption for marketing purposes of variants like business data lake by the tandem of Capgemini and Pivotal (EMC's cloud computing and big data software spin-off) and enterprise data lake by Hadoop vendor Hortonworks.

TechTarget BI experts panel logo

That's the history. But what is a data lake? In the simplest summary, it is the idea that all enterprise data can and should be stored in Hadoop and accessed and used equally by all business applications. At its fullest extent, it amounts to a rip-and-replace strategy for all data warehouses, data marts and eventually even operational databases.

Dumbill suggests that in the latter stages of its evolution, all new applications will be built on the Hadoop data lake, all applications will share data there, and data governance and security processes will be applied there; only a few legacy or specialized applications will stand alone, he predicts. Other writers envisage a longer-term coexistence picture. Let's leave aside the obvious logistical and funding issues of a rip-and-replace approach and focus on the fundamentals. How exactly would one architect a data lake?

Data lake view paints the wrong picture

The phrase doesn't help much -- and that is one of my key issues with it. In a lake, all water is essentially equal: It flows about without constraints inside the lake's banks, its exact source is unidentifiable, and anybody can dip in a bucket and take some of it. But applying such characteristics to data leads to an architectural picture that is completely inappropriate for business data. So, why was the phrase chosen?

We certainly should not consider taking data that has previously been carefully managed and pouring it into a lake of data of unknown provenance.

I suspect that it was to contrast with the highly structured, well-organized image we have of a data warehouse. But while we may be looking at an explosion of unstructured (or semi-structured) data, that doesn't mean we need a completely unstructured data store (i.e., a lake) for it. And, more important, we certainly should not consider taking data that has previously been carefully understood, modeled, structured and managed and pouring it into a lake of data of unknown provenance.

Rather than inventing new marketing-speak, I believe we must address how these very different types of business data can coexist and contribute to the creation of business knowledge. Although some of the concepts and requirements that drove the creation of the data warehouse architecture are no longer applicable, there is a strong and permanent need for a core set of data that defines the state of the business. Such process-mediated data demands a highly structured and regulated data store.

There is also a growing set of requirements for loosely defined and frequently changing data, which can be used to sense trends as part of an effort to anticipate the changing demands on the business. Such machine-generated data and human-sourced information demands an enormous, low-cost and agile data store. (For further details of this tri-domain information model, please refer to Chapter 6 of my new book Business unIntelligence: Insight and Innovation Beyond Analytics and Big Data.)

A different architectural image: Standing on pillars

Although highly structured and agile data environments are very different from one another, there is a strong requirement to be able to relate them to one another. The insights derived from either one on its own are far less useful than those derived from their combined information. I see the resulting architecture as one consisting of a number of technological pillars, each optimized for a particular need and type of processing, but all interlinked through assimilation processes and metadata (or, as I now prefer to call it, context-setting information). That is a very different image than a lake.

No metaphor is perfect. I recall when we discussed the term data warehouse back in the mid-'80s, we worried that it sounded like an unfriendly place for business users. Indeed it was, and the data mart was introduced to address that, even though the mart metaphor has its own shortcomings. However, there exists a fundamental cognitive issue when we start to use wholly inappropriate metaphors to describe the conceptual underpinnings of an architecture. The term data lake creates extensive and probably unintended cognitive dissonance. It does a disservice to those who are trying to define a new architecture for data, something we seriously need.

Data lake is a messy and mindless term. I suggest we dispose of it. Or, should I say, drain the swamp?

About the author:
Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. His current interest is in the wider field of a fully integrated business, covering informational, operational and collaborative environments. He is the founder and principal of 9sight Consulting; email him at barry@9sight.com.

Email us at editor@searchbusinessanalytics.com, and follow us on Twitter: @BizAnalyticsTT.

Next Steps

Advice: Move carefully when choosing big data tools

Learn how big data and BI data differ -- and why the gap needs to be bridged

Find out why big data has changed analytics strategies, but not the need for sound data governance

Semantic data lake development driven by medical technologist.

 

This was last published in April 2014

Dig Deeper on Big data analytics

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

5 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Great article, Barry - and from a number of different perspectives - but I'd like to propose moving beyond the building architecture (warehouse, marts and pillars) to something of an entirely different shape.

The components of your architecture work much more efficiently in the shape of a torus (donut) with the 'pillars' going from a small user-occupied pod in the middle to the outside ring, kind of like a space station. The user has equal access to information in multiple directions and the shape lends itself to a much more dynamic and adaptive storage strategies. (The dual circumferences need less surface area relative to cubes, etc to increase capacity.)

Chuck the lake.
Cancel
Thank you for your suggestion jogorman (and apologies for the late reply... I wasn't notified of your post).

I like the picture you suggest, but suspect that my artistic abilities won't stretch that far. And, if I recall, in a spaceship of the structure you suggest, the people reside in the torus rather than the center :-)

Thanks for agreeing about the lake image.
Cancel
Hi Barry. Interesting article. I refute that the Data Lake analogy is messy and mindless. It took a lot of careful thought when I created it. How it has been used since then is beyond my control. Obviously some of that usage has "polluted" the term in your mind, to the extent that your impression of even its basic meaning is incorrect. http://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited
Cancel
What do you suggest we call it?
Cancel
great article - how about the term "data pool" - like swimming pool - has fast lane of structural operational lines of businesses data with boundaries between them - the floating bonancy line - differnet lines of businesses swim at different speed, but toward the same goal/same direction of finish line perfection

then in one corner of the pool is the kiddie training area - the shallow water - with small volume of experimental data, unorganized pattern, raw, uncleansed, unstructured data, when the kiddie swimmer learning new skills on those new data format.
Cancel

-ADS BY GOOGLE

SearchDataManagement

SearchAWS

SearchContentManagement

SearchCRM

SearchOracle

SearchSAP

SearchSQLServer

SearchSalesforce

Close