In building out its Internet of Things, is HDS acquiring a data refinery, a data lake or a data swamp? See also Part 1
The Data Lake has been filling up nicely since its 2010 introduction by James Dixon, with a number of vendors and analysts sailing forth on the concept. Its precise, architectural meaning has proven somewhat fluid, to continue the metaphor. I criticized it in an article in April last, struggling to find a firm basis for discussion of a concept that is so architecturally vague that it has already spawned multiple interpretations. Dixon commented in a September blog that I was mistaken and set forth that: “A single data lake houses data from one source. You can have multiple lakes, but that does not equal a data mart or data warehouse” and “A Data Lake is not a data warehouse housed in Hadoop. If you store data from many systems and join across them, you have a Water Garden, not a Data Lake.” This doesn’t clarify much for me, especially when read in conjunction with Dixon’s response to one of his commenters: “The fact that [Booz Allen Hamilton] are putting data from multiple data sources into what they call a ‘Data Lake’ is a minor change to the original definition.”
This “minor change” is actually one of the major problems I see from a data management viewpoint, and Dixon admits as much in his next couple of sentences. “But it leads to confusion about the model because not all of the data is necessarily equal when you do that, and metadata becomes much more of an issue. In practice these conceptual differences won’t make much, if any, impact when it comes to the implementation. If you have two data sources your architecture, technology, and capabilities probably won’t differ much whether you consider it to be one data lake or two.” In my opinion, this is the sort of weak-as-water architectural thinking about data that can drown implementers very quickly indeed. Apply it to the data swamp that is the Internet of Things, and I am convinced that you will end up on the Titanic. Given the obvious focus of HDS on the IoT, alarm bells are already ringing loudly indeed.
But there’s more. Recently, Dixon has gone further, suggesting that the Data Lake could become the foundation of a cleverly named “Union of the State”: a complete history of every event and change in data in every application running in the business, an “Enterprise Time Machine” that can recreate on demand the entire state of the business at any instant of the past. In my view, this concept has many philosophical misunderstandings, business misconceptions, and technical impracticalities. (For a much more comprehensive and compelling discussion of temporal data, I recommend Tom Johnston’s “Managing Time in Relational Databases: How to Design, Update and Query Temporal Data”, which actually applies far beyond relational databases.) However, within the context of the HDS acquisition, my concern is how to store, never mind manage, the entire historical data record of even that subset of the Internet of Things that would be of interest to Hitachi or one of its customers. To me, this would truly result in a data quagmire of unimaginable proportions and projects of such size and complexity that would dwarf even the worst data warehouse or ERP project disasters we have seen.
To me, the Data Lake concept is vaguely defined and dangerous. I can accept its validity as a holding pond for the vast quantities of data that pour into the enterprise in vast quantities at high speed, with ill-defined and changeable structures, and often dubious quality. For immediate analysis and quick, but possibly dirty, decisions, a Data Lake could be ideal. Unfortunately, common perceptions of the Data Lake are that, in the longer term, all of the data in the organization could reside there in its original form and structure. This is, in my view, and in the view of Gartner analysts and Michael Stonebraker, to name but a few, not only dangerous in terms of data quality but a major retrograde step for all aspects of data management and governance.
Dixon says of my original criticism “Barry Devlin is welcome to fight a battle against the term ‘Data Lake’. Good luck to him. But if he doesn’t like it he should come up with a better idea.” I fully agree, tilting at well-established windmills is pointless. And as we discovered in our last EMA/9sight Big Data survey (available soon, see preview presentation from January), Data Lake implementations, however variously defined, are already widespread. I believe I have come up with a better idea, too, in the IDEAL and REAL information architectures, defined in depth in my book, Business unIntelligence.
To close on the HDS acquisition of Pentaho, I believe it represents a good deal for both companies. Pentaho gets access to a market and investment stream that can drive and enhance its products and business. And, IoT is big business. HDS gets a powerful set of tools that complement its IoT direction. Together, the two companies should have the energy and resources to clean up the architectural anomalies and market misunderstandings of the Data Lake by formally defining the boundaries and describing the structures required for comprehensive data management and governance.