Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Apache Hadoop technology has so frequently been linked to the concept of big data that the two often appear in lockstep at conferences and industry briefings and in media reports. But there is a growing awareness among IT professionals, consultants and industry analysts that Hadoop is just one piece of the big data puzzle.
The emergence of Hadoop -- an open-source technology that allows companies to store massive amounts of information in a distributed computing environment for analytical purposes -- has certainly gone a long way toward launching the topic of big data into the limelight. But IT industry insiders are also quick to point out that Hadoop is not without its headaches.
"[People] are starting to realize big data and Hadoop are not synonymous," said Brian Hopkins, an enterprise architecture analyst with Cambridge, Mass.-based Forrester Research Inc. "That just because they've downloaded Hadoop, it doesn’t mean big data. It just means they've downloaded Hadoop."
Big data and Hadoop: From buzz to business
Hadoop was originally developed by Internet powerhouses Google and Yahoo and is now sponsored by the Apache Software Foundation. The technology and its iconic yellow elephant logo skyrocketed into popularity in 2011, after it earned the label as a "must have" big-data accessory, and several case studies began to emerge.
Take eBay Inc., for example. The San Jose, Calif.-based company was spotted at several conferences during the year describing its three-tiered data analytics platform. Structured data resides in the first tier -- an enterprise data warehouse that is used for daily housekeeping items, such as feeding business intelligence dashboards and reports. The second tier consists of a Teradata data management platform that is used to store huge amounts of semi-structured information. Fully unstructured data such as textual information lives in the third tier -- a Hadoop cluster reserved for deeper research, analysis and experimentation.
"It's an interesting emerging use case in which Hadoop is being thought of as a staging environment on steroids -- a place to stage and dump a massive amount of stuff," said Hopkins in the recent podcast "Big data is value at extreme scale." "You're not sure what you want to do with it, so stream it as flat files into Hadoop and let Hadoop deal with it."
Hadoop is a distributed file system and therefore its ability to ingest data -- structured, semi-structured and unstructured alike -- outperforms relational databases. As a result, it has become a haven for businesses that want to collect huge amounts of data -- such as freeform text from social media sites, computer generated sensor logs and GPS-based location information -- without bogging down their traditional relational databases.
"Hadoop is a load-and-go environment: Administrators can dump the data into Hadoop without having to convert it into a particular structure," wrote Wayne Eckerson, research director for TechTarget's business applications and architecture media group, in his recent report "Big data and its impact on data warehousing." "Then users (or data scientists) can analyze the data using whatever tools they want."
The fact that Hadoop enables users to explore data in the raw could represent a paradigm shift for data warehousing practitioners, Jill Dyché, vice president of thought leadership at SAS Institute Inc. and a long-time industry analyst, said at the recent 11th annual Pacific Northwest BI Summit.
"In the data warehousing world, we encourage business requirements, we encourage rigor around data quality and we discourage just loading data for its own sake," she said. "But in the big data world, why not? Why not just dump it in there and figure out what else you can do?"
Apache Hadoop hardships
There are additional advantages to Hadoop. MapReduce, for example, one of the tools in the Apache Hadoop library, enables the processing of data in parallel across large data sets. It's a general purpose execution engine that can handle even hand-written code, according to Philip Russom, industry analyst and director of research for The Data Warehousing Institute (TDWI).
But to use MapReduce, programmers must be able to speak its language. And tools such as Hive, which employ a SQL-esque language to access data, don't always make the task any easier.
"I've heard people say, 'It's so easy to learn Hive. Just learn Hive,'" said Russom, while delivering a presentation at a big data seminar last June in San Diego. "But that doesn't solve the real problem of compatibility with [traditional] SQL-based tools."
Businesses interested in analyzing the data they've collected will also require skilled workers -- like data scientists -- who are capable of operating tools built specifically for Hadoop. Data scientists typically boast doctorates and as a result, they can be expensive to hire and hard to find.
Then there are additional drawbacks to the technology: Eckerson describes Apache Hadoop as "wet behind the ears" and lacking in security, data quality and metadata catalog capabilities, among other things. Hopkins calls it "hard to use" and "immature." Russom sees a promising future for the technology but believes mainstream adoption will take several years.
Even at eBay, the company's real competitive differentiator is not Hadoop but the technology that powers the second tier of its data analytics platform, according to Tom Fastner, a senior member of the technical staff and an eBay architect. EBay has dubbed its Teradata-based database system Singularity, and says it offers more than 30 petabytes of space and a lower concurrency than its EDW. The biggest use case for Singularity is user behavior analysis, a process that often leads to valuable business insights, said Fastner, who spoke at a TDWI big data seminar in San Diego last June.
Plus, Forrester's Hopkins said, there are other technologies that may enable businesses to cross into big data territory better than Hadoop can. It all depends on the needs of the business.
"We look at the big data technology stack across two different dimensions," Hopkins said. "One dimension is structure [and] the other dimension is latency."
Each dimension runs from low to high, and the big data tools and technologies are placed along these spectrums. For example, in-memory technology, such as SAP HANA, can provide low latency (real-time) results for highly structured data, whereas massively parallel processing (MPP) technologies, from Teradata or IBM Netezza, for example, can provide ways to handle highly structured data with high latency.
Hadoop, Hopkins said, may be able to handle lightly structured data, such as text, but because it processes data in batches, it cannot provide a real-time environment.
Hadoop may eventually step out of the spotlight and become just another face in the IT crowd, say experts. But right now, it's still a niche technology and its biggest devotees are Internet giants.
Nicole Laskowski is the news editor for SearchBusinessAnalytics.com. Follow her on Twitter: @TT_Nicole.
Learn how Hadoop is creating a growing niche for channel
Find out why Hadoop can't be the only big data tool
Gartner tries to dissect the hype around big data technologies