James Thew - Fotolia
Joe Matarese first became interested in the Apache Spark data processing framework when it was still a project being developed in the University of California, Berkeley's AMPLab. What he liked most was Spark's scalability, which looked like it would enable the technology to process ever-increasing volumes of data for analysis and staging.
His opportunity to implement Spark came in June 2014 in his role as chief technology officer at BlackArrow Inc., a company that helps cable and pay-TV operators like Comcast and Time Warner Cable to dynamically insert ads targeted at different viewers into programs, whether they're broadcast across traditional TV services or newer online platforms. BlackArrow then offers analyses on the ads to see what works and what doesn't. Based on his experience, Matarese said Spark still has some growing up to do, but he thinks its long functionality list makes it ready for prime time now.
"Spark has good adoption and a development community, which is why we love it, but it's still a relatively young technology," Matarese said. "But we've accepted a lot of those issues. The attraction is that [there are] a lot of capabilities, and we expect to make use of those capabilities."
Spark pulls together data for analysis
Joe MatareseCTO, BlackArrow Inc.
BlackArrow currently is using Spark data processing to aggregate local data from cable operators; that includes details on the shows being watched and the viewing platforms being used, plus basic information about viewers. BlackArrow uses the data to determine when commercials should be shown and what types of ads are likely to be most effective based on viewer characteristics. After the fact, Spark feeds data about which ads customers viewed and which they skipped into an Infobright analytical database, which cable operator customers can access to view prebuilt reports or do their own ad hoc analyses on the effectiveness of ads using Pentaho's business intelligence and data visualization tools.
Matarese said he chose Pentaho in July over other data visualization vendors because it offered greater control over how reports are surfaced to BlackArrow's clients. The company previously used a homegrown reporting system, but customers started wanting to do more ad hoc querying, which prompted BlackArrow to decide it needed a packaged tool that would support static reports as well as user queries.
Matarese decided to work with Spark for its data processing speed and power, as well as its SQL querying and machine learning functions. He declined to specify exactly how BlackArrow is using those two Spark data analytics features, saying that most of what the company does with them is proprietary. But he said the ability to query data and run machine learning algorithms on it plays an important role in optimizing ad selection and placement.
Making Spark work, warts and all
That's not to say Spark is free of warts. For example, Matarese said his team has experienced some stability issues since implementing the technology. From the start, BlackArrow went with the Spark distribution from Cloudera, minus the Hadoop component. At one point, they tried to bring the latest open source Apache Spark version into BlackArrow's development environment, but that caused some conflict with processes they had built in an earlier version of Spark. Matarese said solving those kinds of issues takes persistence and an understanding that a Spark implementation is going to require ongoing development and maintenance. "You're going to have to deal with a few hurdles," he warned.
As for whether the Spark data processing framework is a replacement for Hadoop, a perennial question among big data thinkers, Matarese said he sees the two systems supporting very different use cases. For BlackArrow, getting data out to customers was a top priority, so the speed of processing and ease of querying in Spark makes sense. But he sees situations where Hadoop would still be a useful technology, particularly when data volumes are huge and time is less of an issue.
Ultimately, it comes down to choosing the right tool for the job. "It all goes back to understanding what you're trying to do and the problems you're trying to solve," Matarese said.
Spark's sweet spot: When users have a need for analytics speed
Spark powers fast-streaming analytics on online ad data
Proponents say Spark outdoes MapReduce for speed, flexibility