With MapR’s recent announcement of $110 million in funding, following on from Hortonwork’s $100 million and Cloudera’s $900 million, both in March, debate is rife about their different approaches to the market and, of course, which of this big three will eventually win out. Throw in some fear, uncertainty and doubt about the future of the current big data warehouse vendors, a plethora of other players with varying offerings, and you have the food for a real media feeding frenzy.
No doubt the market is undergoing some significant changes and there will be winners and losers. Of course, vendor funding and marketing momentum do make a difference. Certainly, the flood of data from previously untapped or even nonexistent sources expands what businesses can hope to achieve.
But, amid all the excitement, one reality remains constant. One not-so-sexy topic—or actually a related set of topics—will drive the success or failure of real-world implementations. The same topic has been at the heart of data warehousing for nearly thirty years. And whether we call it data warehouse, data lake or data hub, or whether we build it on a relational database or an elephant’s back, is largely irrelevant. This oft-overlooked topic is information (or data) management… using the term in its broadest sense.
Since the earliest days of data warehousing, a significant tension has existed between the urge to deliver early business value and the need to ensure the integrity of the underlying data. Believe it or not, business users were as excited in the 1980s about the opportunities offered by relational databases as today’s users are about big data technologies. The underlying message is not that much different: drive better decision making based on more and better data. The challenge was then—and remains now—how to unlock the value embedded in information that was not designed, built or integrated for that purpose. In fact, today’s problem is even bigger. The data in question in business intelligence was at least owned and designed by someone in the business; big data comes from external sources of often unknown provenance, limited explicit definitions and rapidly changing structures.
For old-timers like me, the open source, big data environment is very reminiscent of the early days of relational databases in the 1980s and data warehousing in the 1990s. The focus is on improving the technological underpinnings, component by individual component. A better database optimizer. Faster throughput load and update (ETL). Security and authentication tools. Moving from batch to interactive and eventually near real-time use.
In data warehousing, the focus has long shifted to the overall process of ensuring data quality and consistency, from modeling business requirements all the way through to production delivery and ongoing maintenance. We see this in tools such as Wherescape and Kalido, which have emerged from teams who had to build and support real, ongoing and changing business needs. Once the excitement of delivering the first data warehouse, lake or hub wears off, the real challenge become apparent—how to keep it going in the face of ever changing and increasingly urgent business demands.
So, how do you eat the Hadoop elephant? In exactly the same way as we’ve eaten relational databases, data warehouses and business intelligence: by lining up the pieces, defining processes and methodologies for governance, and automating and operationalizing the myriad steps as far as possible. It is precisely this long and seemingly tedious process that is largely missing today from the Hadoop world. Its absence is unsurprising; this is a market still in the first flush of delivering discreet helpings of business value.
But, in the long run (and it will be long), this is where the worlds of data warehousing and big data will converge. The knowledge and tooling of information management from data warehousing will be applied to big data. The roles of both relational databases and non-relational techniques will become clearly complementary. A hybrid architecture as outlined in my book, Business unIntelligence, will become the preferred approach. And maybe we’ll discover that the elephant we need to eat is that of information meaning and management rather than the basic data manipulation we see in Hadoop today.