Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
SAN DIEGO -- By now, everyone’s heard of Apache Hadoop. Created by Doug Cutting during his tenure at Yahoo and named after his son’s stuffed elephant, Hadoop is a library of open source software used to create a distributed computing environment. Today, it is touted as one of the newer -- and perhaps one of the best -- technologies designed to extract value out of “big data.”
But as Hadoop becomes a household name, it is also taking on a certain mythological form. Philip Russom, industry analyst and director of research for The Data Warehousing Institute (TDWI) in Renton, Wash., wants to bust that thinking wide open. At last week’s TDWI Solution Summit titled “Big Data Analytics for Real-Time Business Advantage,” Russom presented 12 facts about Hadoop in the hopes of dispelling some of the common myths circulating throughout the industry.
What is Hadoop?
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Read more of the Whatis.com definition.
Fact 1: Hadoop consists of multiple products. People may talk about Hadoop as if it’s this enormous, singular thing, but it’s actually made up of multiple products, Russom said.
“Hadoop is the brand name of a family of open source products,” Russom said. “Those products are incubated and administered by Apache software.”
When people typically think of Hadoop, they think of its Hadoop Distributed File System, or HDFS, which Russom calls a foundation to layer over other products -- like MapReduce.
Fact 2: Apache Hadoop is open source but it’s available from proprietary vendors, too. While the software is open source and can be downloaded for free, vendors like IBM, Cloudera and EMC Greenplum have also made Hadoop available through special distribution, Russom said.
Those distributions tend to come with added features such as administrative tools not offered by Apache Hadoop as well as support and maintenance. Some may scoff at that: Why pay for support when the open source community is free? But Russom said the distributions are making HDFS more powerful for businesses with established IT departments.
Fact 3: Hadoop is an ecosystem, not a single product. The products, which help extend the technology, are being developed by the open source market as well as vendors. Specifically, Russom notes that vendors are providing new products to help make Hadoop look more relational and structured.
“We have this long history of having reporting platforms or data integration platforms and providing interfaces to the newest platforms,” Russom said. “We’re seeing a similar thing right now with Hadoop.”
Fact 4: HDFS is a file system, not a database management system. That lapse in semantics is one of Russom’s biggest pet peeves. While it can manage collections of data, certain database management system attributes are absent in Hadoop.
“Like the ability to randomly access data thanks to query indexes,” he said. “We expect structure, which is typically missing from the kind of data types Hadoop deals with.”
Fact 5: Hive resembles SQL, but it’s not standard SQL. Russom said that fact can be a little unnerving because businesses -- and the tools they use to access data -- tend to be SQL-based. Instead, Hadoop uses Apache Hive and HiveQL, a SQL-like language.
“I’ve heard people say, ‘It’s so easy to learn Hive. Just learn Hive,’ ” Russom said. “But that doesn’t solve the real problem of compatibility with SQL-based tools.”
Russom believes the compatibility issue is a short-term problem, but one that acts as a barrier to mainstreaming Hadoop.
Fact 6: Hadoop and MapReduce are related, but they don’t require each other. MapReduce was developed by Google before HDFS existed, Russom said. Plus, he added, some vendors such as MapR are peddling variations of MapReduce that do not need HDFS.
Russom, though, considers the duo a good combination. Most of the value in HDFS, he said, lies in the tools that can be layered over the distributed file system.
Fact 7: MapReduce provides control for analytics, not analytics per se. MapReduce is a general-purpose execution engine, Russom said. It’s conducive to big data analytics because it can take hand-coded data, automatically process it in parallel and then map the results into a single set. But MapReduce doesn’t actually do the analytics itself.
“This is basic MPP [massively parallel processing] architecture but generalized so that you can throw any code at it imaginable and it just has this talent of making it parallel,” Russom said. “That’s very powerful.”
Fact 8: Hadoop is about data diversity, not just data volume. Some have pigeonholed Hadoop as technology designed for high volumes of data, but Hadoop’s real value is in the way it can handle diverse data, Russom said.
“That can include the stuff most of our data warehouses were not designed to handle,” he said. “Things like semi-structured and fully nonstructured data.”
Fact 9: Hadoop complements a data warehouse; it’s rarely a replacement. Managing diverse data types has induced comments that data warehouses are dying, but Russom cautions against these sweeping statements.
“How often do people replace things in IT?” he asked. “Almost never.”
Data warehouses still do the work they were built to do well, he said, and Hadoop will complement the data warehouse by becoming “an edge system.”
“We see data warehouses and architecture getting more and more distributed with more and more pieces added to it,” he said.
Fact 10: Hadoop enables many types of analytics, not just Web analytics. Hadoop is sometimes seen as technology for Internet giants, which raises the question of whether it will go mainstream. Russom believes it will partly because it can handle broader analytics.
Railroad companies are, for example, using sensors to detect unusually high temperatures on rail cars, which can signal an impeding failure, said Russom, who also cited additional examples from the robotics and retail industries.
Although he sees a promising future for Hadoop, Russom said mainstream adoption will take years.
More on Hadoop
Google, IBM, Oracle want a piece of the ‘big data’ pie
‘Big data’ projects may take more than Hadoop
Hadoop may be hot, but it’s not the most popular ‘big data’ technology
Fact 11: Big data does not require Hadoop. The two have become synonymous, but Russom said Hadoop isn’t the only answer. Specifically, he mentioned products from Teradata, Sybase IQ (now owned by SAP) and Vertica (now owned by Hewlett-Packard).
Plus, some companies have been working with big data longer than Hadoop has been in existence -- for example, the telecommunications industry with its call detail records, Russom said.
Fact 12: Hadoop is not free. While the software is open source, the cost for deploying Hadoop is not. Russom said the lack of features such as administrative tools and support can create additional costs. But it also lacks an optimizer and will require professionals -- who make upward of $200,000 -- to hand code within the environment.
That doesn’t include the hardware costs of a Hadoop cluster or the real estate and the power it takes to make that cluster operational.
“Don’t go thinking Hadoop is free or even cheap,” he said. “There are a lot of costs that go with it.”