Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
Altitude Digital, a Denver-based online advertising platform developer, has a lot of big data processing muscle already. But now it's looking to bulk up even more by deploying the Apache Spark processing engine to add data streaming capabilities to a Hadoop cluster that handles more than 14 TB of transaction data daily -- primarily tracking how users of media websites interact with video ads.
The Spark implementation is scheduled to go live in mid-April, CTO Manny Puentes said. To get the Hadoop system ready to support the stream processing uses enabled by the engine's Spark Streaming module, the company is also expanding the number of compute nodes in its cluster from 30 to 50.
Until now, Altitude Digital has been relying on the Hive data warehouse software , another Apache open source technology, to run queries against the data stored in the cluster, which is based on MapR Technologies Inc.'s Hadoop distribution. But the Hive jobs "are longer-running reports -- and if they fail, to rerun them with terabytes of data could take hours," Puentes said. He added that in tests, Spark Streaming has run queries between four and 20 times faster than Hive, depending on the size and complexity of the data sets being processed.
That kind of improvement can mushroom, because the company's analytics applications -- for example, using data on viewings of video ads to try to optimize the placement of new ones -- often involve running a query, waiting for the results, then refining the query based on the output and running it again. If the test results play out in real applications, the analytics team might be able to get useful answers to complex queries in less than a day, instead of the four to five days it can take now, Puentes said. "That's a huge benefit for our business."
Multiple uses in mind for streaming data
Altitude Digital is also looking to tie together streams from different data sources for analysis, and to use data streaming to fuel algorithms with automated rules for understanding reader behavior based on their browser cookies. Giving online publishers faster dashboard access to trend data is another goal: "We want to be able to return some insights into the data to the publishers in real time as well," Puentes said.
Spark doesn't hold all the answers for Altitude Digital, though. The company will process the full set of its transaction data through Spark Streaming each day, but it plans to continue using Concurrent Inc.'s open source Cascading software to run MapReduce batch reporting jobs. Spark also supports batch processing, with proponents claiming a speed boost of up to 100 times over MapReduce. But Puentes said he wants to continue taking advantage of MapReduce's fault tolerance "to ensure that the jobs get done."
Sharethrough Inc. is another online advertising company that has adopted Spark Streaming -- in its case to augment a Cloudera-based Hadoop cluster running on the Amazon Web Services cloud. Sharethrough began using Databricks Inc.'s cloud implementation of Spark in mid-2013; it currently runs 500 GB of Internet clickstream and ad-visibility data through the stream processing module daily. The Spark system powers machine learning applications that analyze the performance of the "native ads" that the San Francisco-based company embeds into news feeds for its clients.
Rob Slifka, Sharethrough's vice president of engineering, said it quickly became apparent after the Hadoop cluster was deployed two years ago that the batch-oriented system couldn't meet the company's need for more real-time analytics. Advertisers and publishers had to use data that was several hours old to make decisions on ad placements, which made it hard to be sure they were optimizing the use of their advertising budgets. Slifka said doing so can be complicated because of the nature of the ads supported by Sharethrough's platform -- they look like teasers for regular online news items, with headlines and thumbnail text that can be paired up in different combinations.
Data streaming and click-through rates
Some of those headline-text combos are more effective than others. In a test conducted by Sharethrough, for example, click-through rates on variations of an internal ad ranged from less than 1% to more than 7% -- a big difference in the online advertising world. Being able to quickly figure out which versions of an ad work best was a big impetus for using Spark Streaming, Slifka said. "If you have 10 combinations and five of them aren't performing very well, you want to know pretty quickly that they're not."
Thanks to the data streaming capabilities, Sharethrough can test various ads with website users, then quickly analyze the results to identify the ones that are resonating with readers. "We never pick a single winner [up front], ever," Slifka said. With Spark Streaming, "we can give a couple of the combinations a chance to make it up to the top."
Russell Cardullo, the technical lead on the Spark implementation, said stream processing makes performance monitoring more important -- and more challenging. "You need to be cognizant that this is going to be running 24 hours a day, seven days a week," Cardullo said. "The data is coming at [the system] all the time, so you need more rigor around it than just writing a job and saying you'll fix it if there are problems." He added, though, that the only processing problems Sharethrough has run into with Spark Streaming thus far have been caused not by issues with the software itself but by snafus with the Amazon Kinesis and RabbitMQ technologies that the company uses to feed data into Spark.
Gartner Inc. analyst Nick Heudecker and William McKnight, president of McKnight Consulting Group, also detailed other challenges facing organizations that are looking to mix big data and stream processing technologies. That includes building a robust technical architecture that can handle the data processing workload -- but also making sure that a company's analytics and business processes are equipped to deal with and take advantage of the incoming rush of streaming data. "There's no point in accelerating 5% of your business process if the other 95% isn't going to change," Heudecker said.
Spark Streaming brings insights to medical claims at RelayHealth
Firms turning to big data and data streaming to process info
Is Spark Streaming ready? One pro weighs in
Learn why some early adopters are taking it easy on Spark cluster rollouts