Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
In the past few years, Hadoop has earned a lofty reputation as the go-to big data analytics engine. To many, it's synonymous with big data technology. But the open source distributed processing framework isn't the right answer to every big data problem, and companies looking to deploy it need to carefully evaluate when to use Hadoop -- and when to turn to something else.
For example, Hadoop has ample power for processing large amounts of unstructured or semi-structured data. But it isn't known for its speed in dealing with smaller data sets. That has limited its application at Metamarkets Group Inc., a San Francisco-based provider of real-time marketing analytics services for online advertisers.
Metamarkets CEO Michael Driscoll said the company uses Hadoop for large, distributed data processing tasks where time isn't a constraint. That includes running end-of-the-day reports to review daily transactions or scanning historical data dating back several months.
But when it comes to running the real-time analytics processes that are at the heart of what Metamarkets offers to its clients, Hadoop isn't involved. Driscoll said that's because it's optimized to run batch jobs that look at every file in a database. It comes down to a tradeoff: In order to make deep connections between data points, the technology sacrifices speed. "Using Hadoop is like having a pen pal," he said. "You write a letter and send it and get a response back. But it's very different than [instant messaging] or email."
Kelly Stirmandirector of product marketing, 10gen Inc.
Because of the time factor, Hadoop has limited value in online environments where fast performance is crucial, said Kelly Stirman, director of product marketing at 10gen Inc., developer of the MongoDB NoSQL database. For example, analytics-fueled online applications, such as product recommendation engines, rely on processing small amounts of information quickly. But Hadoop can't do that efficiently, according to Stirman.
No database replacement plan
Some businesses might be tempted to try scrapping their traditional data warehouses in favor of Hadoop clusters because technology costs are so much lower with the open source technology. But Carl Olofson, an analyst at market research company IDC, said that is an apples-and-oranges comparison.
Olofson said the relational databases that power most data warehouse are used to accommodating trickles of data that come in at a steady rate over a period of time, such as transaction records from day-to-day business processes. On the other hand, he added, Hadoop is best suited to processing vast stores of accumulated data.
And because Hadoop is typically used in large-scale projects that require clusters of servers and employees with specialized programming and data management skills, implementations can become expensive, even though the cost-per-unit of data may be lower than with relational databases. "When you start adding up all the costs involved, it's not as cheap as it seems," Olofson said.
Specialized development skills are needed because Hadoop uses the MapReduce software programming framework, which limited numbers of developers are familiar with. That can make it difficult to access data in Hadoop from SQL databases, according to Todd Goldman, vice president of enterprise data integration at software vendor Informatica Corp.
Various vendors have developed connector software that can help move data between Hadoop systems and relational databases. But Goldman thinks that for many organizations, too much work is needed to accommodate the open source technology. "It doesn't make sense to revamp your entire corporate data structure just for Hadoop," he said.
Helpful, not hype-full
One example of when to use Hadoop that Goldman cited is as a staging area and data integration platform for running extract, transform and load (ETL) functions. That may not be as exciting an application as all the hype over Hadoop seems to warrant, but Goldman said it particularly makes sense when an IT department needs to merge large files. In such cases, the processing power of Hadoop can come in handy.
Driscoll said Hadoop is good at handling ETL processes because it can split up the integration tasks among numerous servers in a cluster. He added that using Hadoop to integrate data and stage it for loading into a data warehouse or other database could help justify investments in the technology—getting its foot in the door for larger projects that take more advantage of Hadoop's scalability.
Of course, leading-edge Internet companies such as Google, Yahoo, Facebook and Amazon.com have been big Hadoop users for years. And new technologies aimed at eliminating some of Hadoop's limitations are becoming available. For example, several vendors have released tools designed to enable real-time analysis of Hadoop data. And a Hadoop 2.0 release that is in the works will make MapReduce an optional element and enable Hadoop systems to run other types of applications.
Ultimately, it's important for IT and business executives to cut through all the hype and understand for themselves where Hadoop could fit in their operations. Stirman said there's no doubt it's a powerful tool that can support many useful analytical functions. But it's still taking shape as a technology, he added.
"There's so much hype around it now that people think it does pretty much anything," Stirman said. "The reality is that it's a very complex piece of technology that is still raw and needs a lot of care and handling to make it do something worthwhile and valuable."
See how businesses are leveraging Hadoop clusters
Learn why some businesses are struggling to implement Hadoop