Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
Edo Interactive ran into a big problem several years ago: There weren't enough hours in the day for its data warehouse system to process all of the credit and debit transaction data the company uses to recommend personalized promotional offers to retailers and restaurants.
"We were taking 27 hours to process our daily build, when it worked," said Tim Garnto, Edo's senior vice president of infrastructure and information systems. So in 2013, Edo replaced the existing system, based on a PostgreSQL relational database, with a Hadoop cluster that has become a data lake architecture for the organization.
Garnto's team pulls data on more than 50 million U.S. retail transactions a day into the 20-node cluster, which runs on Cloudera's Hadoop distribution and is fed using data integration tools from Pentaho. The data, collected from banks and credit card companies, is processed and then run through predictive models designed to pinpoint individual cardholders for coupons. The coupons are promoted in weekly emails sent by Edo's business partners and automatically applied when purchases are made.
The daily data build is down to about four hours, and Garnto said Edo's data analysts can do their work "in minutes or hours," depending on the complexity of the models they're running. "Before," he added, "they were just kind of dead in the water."
But it hasn't been all sunshine and easy sailing on the data lake, according to Garnto -- a sentiment echoed by other IT managers who have led implementations of large Hadoop systems. Initially, only one IT staffer at Edo had experience with Hadoop and the MapReduce programming framework. The company, which has joint headquarters in Chicago and Nashville, Tenn., invested in training for other workers to build up Hadoop skills internally, but then it had to wean them off of writing data queries in the more familiar relational way. "We spent a lot of time updating that process," Garnto said.
Creating a two-step routine for making the incoming raw data consistent and generating standardized analytics data sets also took time to figure out. And the cluster, which currently holds a total of 45 billion records amounting to 255 terabytes (TB) of data, has become so central to Edo's business operations that Garnto needs to tread carefully in managing it and adding new Hadoop ecosystem technologies; otherwise, an adjustment made for one part of the company could affect how the system works for others. "Of all the challenges we've faced, that's going to be the most interesting," he said, adding that there may be a need for a steering committee to help oversee the development roadmap for the cluster.
Data lake enables instant analysis
Webtrends Inc., which collects and processes activity data from websites, mobile phones and the Internet of Things, is another data lake user. The Portland, Ore., company deployed a Hortonworks-based Hadoop cluster with a soft launch in July 2014 and went fully live with it at the start of 2015 -- initially to support a product called Explore that lets corporate marketers do ad hoc analysis of customer data. Peter Crossley, director of product architecture at Webtrends, said about 500 TB of data is being added each quarter to the 60-node cluster, which is up to 1.28 petabytes in total now.
Over time, Webtrends plans to use the Hadoop platform as a replacement for a homegrown system that stores data in flat files on network-attached storage devices. Using the Apache Kafka message queuing technology and automated processing scripts, Internet clickstream data can be streamed into the cluster and prepared for analysis in just 20 to 40 milliseconds, Crossley said. As a result, the reporting and analytics process can start "almost instantaneously" -- much faster than with the older system. The Hadoop cluster also supports more advanced analytics, and hardware costs are 25% to 50% lower on it.
Crossley said, though, that adopting the data lake concept required an internal "mind-set change" on managing and using the information that Webtrends collects for its clients. Before, the company primarily built general-purpose reports from the broad array of data it warehoused. But, he said, a data lake "is less about a single source of truth [in and of itself], and more that this is a single source of truth you can build multiple data sets on top of," for different analytics uses.
Webtrends also had to think hard about its data lake architecture and data governance processes to keep the Hadoop cluster from becoming "a data marsh," as Crossley put it. The raw data going into the system is loosely structured, but he added there are "very strict" rules on what it should look like. In addition, his team has partitioned the cluster into three separate tiers: one for raw data, a second for augmented daily data sets and another for third-party information that gets pulled in. Each tier has its own data classifications and governance policies, based on the particulars of the different data sets.
Don't lose control of your data
Suren Nathan, CTO at Razorsight Corp. in Reston, Va., also pointed to the need to be "very disciplined and organized" in setting up and managing a Hadoop data lake. If not, Nathan said, the system can quickly turn into an out-of-control dumping ground -- "like a SharePoint portal with all these documents that nobody knows how to find."
Razorsight, which offers a set of cloud-based analytics services for telecommunications companies, began using a cluster that runs the Hadoop distribution from MapR Technologies in the second quarter of 2014. Sets of customer, operations and network data from clients are pulled into the system via a homegrown ingestion tool and run through the Spark processing engine to prepare them for analysis by Razorsight's data scientists; the cluster has five production nodes and a 120-TB storage capacity.
Like Webtrends, Razorsight has split its data lake into three partitions. In Razorsight's case, one data lake holds data that's less than six months old; another contains older but still active data and the third is an archive for information that's no longer used but needs to be retained. At the moment, there's a little over 20 TB of data in the two active zones, according to Nathan. To help make the system work smoothly, he added, Razorsight brought in new workers with experience in data governance and development of distributed systems, while also retraining existing IT staffers on using Hadoop, Spark and related technologies.
It's also moving to the new platform in stages. At about $2,000 per TB, the Hadoop cluster costs one-tenth as much as the IBM Netezza data warehouse system the company had in place previously. But Nathan said Razorsight first set up the cluster solely for data storage, then moved the processing and preparation stage there as well. Analytical modeling and data visualization are still done on the old system, partly because of ties between the Netezza hardware and IBM's SPSS analytics software. The modeling will stay put for now, but Nathan expects to move the visualization layer and Razorsight's repository of analytical results into the data lake architecture by the end of this year.
Is data lake really the right term?
Hadoop vendors back data lake approach
Don't forget security when planning your data lake