agsandrew - Fotolia
Last year, Hadoop generally was still viewed as a giant, low-cost data processing pump for refining multistructured data and delivering it to the data warehouse. But it's time to take a new look because Hadoop 2, and in particular its new YARN cluster resource manager, changes everything.
Released in October 2013, Hadoop 2 turns the open source distributed processing platform into a multipurpose operating system for big data applications. Rather than supporting just one type of data processing, Hadoop 2 supports any data processing application written to the YARN interface. As such, it can support not only batch processing (i.e., MapReduce applications) but also real-time queries, enterprise search apps, stream processing, in-memory computing, and whatever else anyone dreams up and writes to YARN.
The upshot is revolutionary: Rather than move data to a variety of specialized applications and systems for processing, companies can store the data in Hadoop 2 systems and process it there as well.
That message was trumpeted recently at an analyst day hosted by Cloudera, which was the first vendor to commercialize a Hadoop distribution and related support services. In his opening remarks, Cloudera CEO Tom Reilly said that Hadoop 2 will change how companies architect analytics systems: "Rather than move data to compute resources, companies will move compute resources to data, saving enormous amounts of time and money."
Data lake harbors new types of apps
The new version has given rise to the notion of a Hadoop-based data lake. Cloudera is also one of the first companies to commercialize a data lake offering, which it calls an enterprise data hub (EDH). With an annual subscription, Cloudera Enterprise Data Hub Edition customers can access core Hadoop plus six premium components (or data processing engines), including MapReduce for batch processing, Cloudera's Impala query engine for SQL analytics, Solr for enterprise search, Spark for machine learning applications, Spark Streaming for stream processing and HBase for operational processing. Cloudera says a raft of third-party applications are also on the way.
There are some perils lurking in the Hadoop 2 data lake -- see my blog post about them. But according to Reilly, the lake spawns a new breed of "converged applications" that can deliver enormous business value. For instance, a company can use Spark Streaming to stream data from a sensor network into a Spark in-memory database, where it is analyzed and turned into a model that gets embedded in a high-volume Web application running in HBase. All the while, the data never leaves the Hadoop cluster, which greatly simplifies data processing and reduces costs.
Although many skeptics claim that Hadoop isn't ready to support enterprise-caliber production applications, Cloudera says demand for its EDH is high. In fact, the company reportedly sold eight subscriptions within six weeks at the end of this year's first quarter after making the Enterprise Data Hub Edition commercially available.
Hadoop deployments: Evolution, not revolution
However, most companies are adopting Hadoop gradually, said Amr Awadallah, Cloudera's co-founder and CTO. Their initial motivation, he said, is to improve operational efficiency. Either they want to reduce the cost of storing large volumes of data; accelerate extract, transform and load (ETL) processes that are being squeezed by shrinking batch windows; or optimize the performance of a data warehouse by offloading ETL workloads or moving unused data to archival storage.
After organizations squeeze the cost efficiencies from their data architectures, Awadallah said, they implement Hadoop more strategically to deliver greater business value. At first, they might use Hadoop to give business analysts, data scientists and line-of-business workers quicker access to data to help solve pressing business problems. Rather than wait for the IT department to move data from Hadoop into a data warehouse or other downstream systems, end users query data directly in Hadoop using SQL-like data access and analytics tools.
Once the users are comfortable doing that, Awadallah said, many organizations move on to consolidating Hadoop clusters into a data lake and implementing YARN-compliant engines so they can build the kind of converged applications described above.
Safe steps toward Hadoop systems
David McJannet, vice president of marketing at Hortonworks, Cloudera's closest rival, reinforces Awadallah's depiction of the Hadoop journey. He said most companies go through several stages, from denial to acceptance, when they're confronted with evidence that Hadoop storage is 30 times to 50 times cheaper than keeping data in traditional systems.
And rather than take a bold leap into the unknown with a startup company, McJannet said Hortonworks customers usually recruit a trusted partner from the commercial world to help them navigate the new terrain and blend the new IT world with the old. This evolutionary approach to implementing Hadoop is the centerpiece of Hortonworks' strategy.
But McJannet added that most Hortonworks customers are using Hadoop to support new applications with multistructured data, not to achieve operational efficiencies. "About 70% of our deals are for net-new applications, and 30% focus on data warehousing optimization," he said.
During my recent discussions with Hadoop vendors, including Cloudera, Hortonworks and MapR Technologies, they all said they have experienced a rapid uptake in the number of inquiries and deals since last fall, when Hadoop 2 was released. If those claims are true, it's likely that some leading-edge customers are quickly moving beyond the tire-kicking stage and into production with Hadoop systems. If so, 2014 could be the year in which Hadoop goes mainstream -- and really starts shaking up the data management and analytics marketplace.
About the author:
Wayne Eckerson is principal consultant at Eckerson Group, a consulting firm that helps business leaders use data and technology to drive better insights and actions. His team provides information and advice on business intelligence, analytics, performance management, data governance, data warehousing and big data. Email him at email@example.com.
Find out why Wayne Eckerson says data warehouses are still needed in the big data era
See why cloud(y) days can mean clear sailing for BI programs
Learn about the "last mile" in analytical model development