In a recent conversation with project team members from a client, one shared an internal slide deck used to promote the benefits of big data (in general) and Hadoop (in particular) among both key management decision makers and the development and implementation groups in IT. One interesting aspect of the presentation was the comparison of Hadoop to earlier computing ecosystems and the casting of the open source distributed processing framework in the role of "operating system" for a big data environment.
At the time the slide deck was assembled, that characterization was perhaps somewhat of a stretch. The core components of the initial Hadoop release were the Hadoop Distributed File System (HDFS) for storing and managing data and an implementation of the MapReduce programming model. The latter included application programming interfaces, runtime support for processing MapReduce jobs and an execution environment that provided the infrastructure for allocating resources in Hadoop clusters and then scheduling and monitoring jobs.
While those components acted as proxies for aspects of an operating system, the framework's processing capabilities were limited by its architecture, with the JobTracker resource manager and the application logic and data processing layers all combined in MapReduce.
So what did that mean for running business intelligence and analytics applications? It had a big hampering effect: Although the task scheduling capabilities allowed for parallel execution of MapReduce applications, typically only one batch job could execute at a single time. That basically prevented the interleaving of different types of analysis in Hadoop systems. Batch analytics applications would have to run on a separate set of cluster nodes than a front-end query engine accessing data in HDFS.
Static approach holds down Hadoop processing
In addition, resource allocation was effectively static -- nodes assigned as Reduce nodes might sit idle during an application's Map phase, with the reverse happening during the Reduce process. As a result, nodes that might have been used for real-time processing were unavailable.
Lastly, the serial scheduling of batch execution jobs in a cluster supported neither MapReduce multitasking nor running MapReduce applications simultaneously with ones developed using other programming models. Again, that affected the ability of Hadoop users to engage in any kind of ad hoc querying or real-time data analysis.
More on managing Hadoop systems and
big data environments
Get more expert insight and advice in our guide to managing Hadoop deployments
Learn about the new integration requirements created by big data projects
Read an FAQ article to get answers about Hadoop 2's new features
But when you review the details of the new Hadoop 2 release, you'll see that some of the JobTracker's responsibilities have been split off from MapReduce, a move that's intended to relieve some of the constraints inherent in Hadoop's initial development and execution architecture. That's good news for organizations looking to run analytical applications on Hadoop systems.
The primary idea behind YARN, one of Hadoop 2's key additions, is to divorce resource management from application management. Instead of relying on MapReduce for both scheduling and processing jobs, those tasks now are handled by separate components.
YARN -- short, in good-humored fashion, for Yet Another Resource Negotiator -- includes a ResourceManager that becomes the authority for scheduling jobs and allocating resources among applications across a cluster, plus a NodeManager agent that oversees operations on individual compute nodes. But the ResourceManager doesn't manage the application execution process. In Hadoop 2 systems, each application is controlled by its own ApplicationMaster, which assesses resource requirements, requests the necessary level of resources and works with the node agents to launch jobs and track their progress.
Analytics advances in big data environment
Those changes will have some positive effects on the Hadoop framework's ability to support real-time analytics and ad hoc querying. First, segregating resource management from application management and processing reduces the internal overhead of what had been the JobTracker's combined role and enables the ResourceManager to be more efficient and effective in allocating a cluster's inventory of CPU, disk, and memory resources to applications.
But the segregation won't lead only to more balanced workloads across cluster nodes; it also makes it possible for users to simultaneously run MapReduce and non-MapReduce applications on top of YARN. In addition to MapReduce batch jobs, that could include the likes of event stream processing, NoSQL database, interactive querying and graph processing and analysis applications.
Also, allowing multiple types of applications to run at the same time in relative isolation begins to address an issue that is sometimes overlooked with open source technologies -- data protection and system security. Embedding all oversight and monitoring of individual jobs in the ApplicationMaster prevents faulty or malicious code that gets into one application from affecting others, affording a greater degree of processing safety in a big data environment.
The improvements provided by YARN are recognition of the need for "hardening" Hadoop and transitioning it toward a more general operating system model. They also greatly increase Hadoop's analytical flexibility: With Hadoop 2, real-time analytics, batch analysis and interactive data management for reporting and querying can all find a place at the big data table.
About the author:
David Loshin is president of Knowledge Integrity Inc., a consulting, training and development services company that works with clients on big data, business intelligence and data management projects. He also is the author of numerous books, including Big Data Analytics. Email him at firstname.lastname@example.org.