News Stay informed about the latest enterprise technology news and product updates.

So, how do you eat a Hadoop Elephant?

eat elephantWith MapR’s recent announcement of $110 million in funding, following on from Hortonwork’s $100 million and Cloudera’s $900 million, both in March, debate is rife about their different approaches to the market and, of course, which of this big three will eventually win out. Throw in some fear, uncertainty and doubt about the future of the current big data warehouse vendors, a plethora of other players with varying offerings, and you have the food for a real media feeding frenzy.

No doubt the market is undergoing some significant changes and there will be winners and losers. Of course, vendor funding and marketing momentum do make a difference. Certainly, the flood of data from previously untapped or even nonexistent sources expands what businesses can hope to achieve.
But, amid all the excitement, one reality remains constant. One not-so-sexy topic—or actually a related set of topics—will drive the success or failure of real-world implementations. The same topic has been at the heart of data warehousing for nearly thirty years. And whether we call it data warehouse, data lake or data hub, or whether we build it on a relational database or an elephant’s back, is largely irrelevant. This oft-overlooked topic is information (or data) management… using the term in its broadest sense.

Since the earliest days of data warehousing, a significant tension has existed between the urge to deliver early business value and the need to ensure the integrity of the underlying data. Believe it or not, business users were as excited in the 1980s about the opportunities offered by relational databases as today’s users are about big data technologies. The underlying message is not that much different: drive better decision making based on more and better data. The challenge was then—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures.

For old-timers like me, the open source, big data environment is very reminiscent of the early days of relational databases in the 1980s and data warehousing in the 1990s. The focus is on improving the technological underpinnings, component by individual component. A better database optimizer. Faster throughput load and update (ETL). Security and authentication tools. Moving from batch to interactive and eventually near real-time use.

In data warehousing, the focus has long shifted to the overall process of ensuring data quality and consistency, from modeling business requirements all the way through to production delivery and ongoing maintenance. We see this in tools such as Wherescape and Kalido, which have emerged from teams who had to build and support real, ongoing and changing business needs. Once the excitement of delivering the first data warehouse, lake or hub wears off, the real challenge become apparent—how to keep it going in the face of ever changing and increasingly urgent business demands.

So, how do you eat the Hadoop elephant? In exactly the same way as we’ve eaten relational databases, data warehouses and business intelligence: by lining up the pieces, defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world. Its absence is unsurprising; this is a market still in the first flush of delivering discreet helpings of business value.

But, in the long run (and it will be long), this is where the worlds of data warehousing and big data will converge. The knowledge and tooling of information management from data warehousing will be applied to big data. The roles of both relational databases and non-relational techniques will become clearly complementary. A hybrid architecture as outlined in my book, Business unIntelligence, will become the preferred approach. And maybe we’ll discover that the elephant we need to eat is that of information meaning and management rather than the basic data manipulation we see in Hadoop today.

Image from

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Excellent post Barry, with just one additional point. During most of those three decades a great deal of R&D has taken place in conjunction with improvements in underlying infrastructure taken together allow for improved efficiency, including automation of adaptive data management. The process is much the same as you describe, although the tools are changing-- it's just automated for real-time streaming and translated to natural language for all knowledge workers. This also allows for continuous advanced analytics rather than waiting for analysts, data scientists, or software integrators. 

As Vint Cerf was recently quoted in the WSJ: "governments and corporations are finally figuring out how important adaptability is. AI and natural language processing may well make the Internet far more useful than it is today."

Vint was one of the first externally I shared my AI systems patent application, and despite a half dozen professors before him--he's the one who found the misspelled word, so we kept it as a reminder.

While we can't reveal uncovered work of course, you might find Kyield interesting. It has been a long haul indeed, but we're now working in the early stages with some of the largest and most complex enterprise network environments. KR, MM

Thanks, Mark. Will certainly take a look at Kyield... Barry