Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
With all the buzz that Hadoop is generating in IT circles these days, it's easy to start thinking that the open source distributed processing framework can handle just about anything in big data environments. But real-time analysis involving ad hoc querying of Hadoop data has been a notable exception.
Hadoop is optimized to crunch through large sets of structured, unstructured and semi-structured data, but it was designed as a batch processing system -- something that doesn't lend itself to fast data analysis performance.
And Jan Gelin, vice president of technical operations at Rubicon Project, said analytics speed is something that the online advertising broker needs -- badly.
Rubicon Project is based in Playa Vista, Calif., and offers a platform for advertisers to use in bidding for ad space on webpages as Internet users visit the pages. The system allows the advertisers to see information about website visitors before making bids to try to ensure that ads will be seen only by interested consumers. Gelin said the process involves a lot of analytics, and it all has to happen in fractions of a second.
Rubicon leans heavily on Hadoop to help power the ad-bidding platform. But the key, Gelin said, is to pair Hadoop with other technologies that can handle true real-time analytics. Rubicon uses the Storm complex event processing engine to capture and quickly analyze large amounts of data as part of the ad bidding process. Storm then sends the data into a cluster running MapR Technologies Inc.'s Hadoop distribution. The Hadoop cluster is primarily used to transform the data to prepare it for more traditional analytical applications, such as business intelligence reporting. Even for that stage, though, much of the information is loaded into a Greenplum analytical database after the transformation process is completed.
Gelin said the sheer volume of data that Rubicon produces on a daily basis pointed it toward Hadoop's processing muscle. But when it comes to analyzing the data, he added, "You can't take away the fact that Hadoop is a batch-processing system. There are other things on top of Hadoop you can play around with that are actually like real real-time."
Several Hadoop vendors are trying to eliminate the real-time analytics restrictions. Cloudera Inc. got the ball rolling in April by releasing its Impala query engine, promising the ability to run interactive SQL queries against Hadoop data with near-real-time performance. Pivotal, a data management and analytics spinoff from EMC Corp. and its subsidiary VMware, followed three months later with a similar query engine named Hawq. Also looking to get in the game is Splunk Inc., which focuses on capturing streams of machine-generated data; it made a Hadoop data analysis tool called Hunk generally available in late October.
The Hadoop 2 version of the framework, which was released in October as well, also aids the cause by opening up Hadoop systems to applications other than MapReduce batch jobs. With all the new tools and capabilities available or on the way, Hadoop may soon be up to the real-time analysis challenge, said Mike Gualtieri, an analyst at Forrester Research Inc. One big factor working in its favor, he added, is that vendors as well as Hadoop users are determined to make the technology function in real or near real time for analytics applications.
"Hadoop is fundamentally a batch operation environment," Gualtieri said. "However, because of the distributed architecture and because a lot of use cases have to do with putting data into Hadoop, a lot of vendors or even the end users are saying, 'Hey, why can't we do more real-time or ad hoc queries against Hadoop,' and it's a good question."
Real-time analysis roadblocks
Gualtieri sees two main real-time hurdles for Hadoop. First, he said, most of the new Hadoop query engines still aren't as fast as running queries against mainstream relational databases. Tools like Impala and Hawq provide interfaces that enable end users to write queries in the SQL programming language. The queries then get translated into MapReduce for execution on a Hadoop cluster, but that process is inherently slower than running a SQL query directly against a relational database, according to Gualtieri.
More info on real-time analytics
Read a series of articles on real-time analytics best practices
Learn how one company made real-time analysis a reality
Find out how big data can strain real-time BI systems
The second challenge Gualtieri sees is that Hadoop currently is a read-only system once data has been written into the Hadoop Distributed File System (HDFS). Users can't easily insert, delete or modify individual pieces of data stored in the file system like they can in a relational database, he said.
While the challenges are real, Gualtieri thinks they can be overcome. For example, Hadoop 2 includes a capability for appending data to HDFS files.
Gartner Inc. analyst Nick Heudecker said via email that even though the new query engines might not support true real-time data analytics functionality, they do enable users with less technical expertise to access and analyze data stored in Hadoop. That can decrease the cycle time and cost associated with running Hadoop analytics jobs because MapReduce developers no longer need to be involved in writing queries, he said.
Organizations will have to decide for themselves whether that's enough of a justification for deploying such tools. The scalability and affordability of Hadoop are also alluring -- but that can lead some businesses down the wrong path, said Patricia Gorla, a consultant at IT services provider OpenSource Connections LLC in Charlottesville, Va. What's required, Gorla cautioned, is finding the best fit for Hadoop -- and not trying to force-fit it into a systems architecture where it doesn't belong. "Hadoop is good at what it's good at and not at what it's not," she said.