Although the yellow elephant continues to trample all over the world of Information Management, it is becoming increasingly difficult to say where more traditional technologies end and Hadoop begins.
Actian’s (@ActianCorp) excellent presentation by John @santaferraro and @emmakmcgrattan at the #BBBT on 24 June emphasized again—if such emphasis were needed—that the boundaries of the Hadoop world are becoming very ill-defined indeed, as more traditional engines are adapted to run on or in the Hadoop cluster. The Actian Analytics Platform – Hadoop SQL Edition embeds their existing X100 / Vectorwise SQL engine directly in the nodes of the Hadoop environment. The approach offers the full range of SQL support previously available in Vectorwise on Hadoop, and claims 4-30 times speed improvement over Cloudera Impala in a subset of TPC-DC benchmarks.
Architecturally as interesting, shown in the accompanying figure, is the creation and use of column-based, binary, compressed vector files by the X100 engine for improved performance and the subsequent replication of these files by the Hadoop system. These latter files support co-location of data for joins for a further performance boost.
This is, of course, the type of integration one would expect from seasoned database developers when they migrate to a new platform. Actian is not alone in doing this. Pivotal’s HAWQ has Greenplum technology embedded. It would be surprising if IBM’s on-Hadoop Big SQL offering is not based on DB2 knowledge at the very least. These are the types of development that YARN facilitates in version two of Hadoop. Debate will rage about how deeply integrated the technologies are and how far they take advantage of the Hadoop infrastructure. But that’s just details.
The real point is that the mix and match of functionality and data seen here emphasizes the conundrum I posed at the top of the blog. Where does Hadoop end? And where does “NoHadoop” (well, if we can have NoSQL…) begin? What does this all mean for the evolution of Information Management technology over the coming few years?
As the title suggests, I believe that we are on the crest of the third wave of Hadoop. As in Alvin Toffler’s prescient 1980 book of the same name, this third wave of Hadoop could also be claimed to be post-industrial in nature. Let’s look at the three waves in context.
The first wave of Hadoop was the fertile soil of the Internet in which the cute yellow elephant would grow. The technical pioneers of the Web, particularly Google, defined and built bespoke versions of the new data management (in a loose sense of the term) ecosystem that was needed for the novel types and enormous volumes of data they were handling. Their choice of parallelized commodity hardware and software was the foundation for and driving force of the second wave.
The second wave industrialized the approach through the open source software movement. Here we saw the proliferation of Apache projects and the emergence of commercial, independent distros from the likes of Cloudera and Hortonworks. The ecosystem gradually moved from custom code built by expert developers to a parallel programming environment with a plethora of utilities to aid development, deployment and use. This wave is now receding as it has become clear that an integrated, managed and database-centric environment is now needed. Such a development is fully expected: we had exactly the same cycle in mainframes in the ’60s and ’70s and in distributed computing in the ’80s and ’90s. However, there is an important difference to consider now as the third wave of Hadoop breaks: we are no longer on a virgin shore.
The third wave of Hadoop is seeing the devaluing of the file system in favor of databases that run on top of it. Individual programs are being displaced by systems to manage resource allocation, ensure transaction integrity and provide security. While the companies and individuals who drove the second wave do recognize this shift and are developing systems such as Impala, Falcon, Sentry and more, they start from a disadvantage. The database and other system management technologies that were developed in the mainframe and distributed environments are far more robust and can be migrated to the new commodity hardware and software platform. Commercially, the vendors of these tools have no choice but to move into this market. And they are doing so. YARN has begun to unlock Hadoop from its programming origins.
I suggest that the unique strength of the Hadoop world comes not from its open source software base but from its hardware foundation of parallel commodity machines. Such hardware drives down the capital cost of playing in the big data arena. On the other hand, it increases the operational cost and management complexity. These latter aspects will militate against the open source, let-a-thousand-flowers-bloom approach that is currently being pursued; we need a data management infrastructure, including a fully functional relational database, in this environment far more than yet another NoSQL (or YANS, for short?). Realistically, such mission-critical software is more likely to come from traditional vendors, adapted from existing products, patents and skills. In this, Actian and others are showing the way.
In this third wave, of course, a new model for funding must emerge. Traditional, and often exorbitant, software pricing models cannot survive. On the other hand, the open source free-software-paid-maintenance model, while offering much innovation, is unlikely to be able to fund the dedicated, on-going development required for robust, reliable and secure infrastructure. Are any of the big players in the merging Hadoop market of this third, post-industrial wave willing to step up to this challenge?
Pictures courtesy (1) Actian; (2) Bhajju Shyam, The London Jungle Book.