Guido Vrola - Fotolia
At last year's Spark Summit conference, Patrick Wendell, a software engineer at Databricks Inc. and a contributor to the Apache Spark open source project, said the technology's data processing capabilities are impressive but its real power lies in the Spark library components that sit on top of the core engine. "The future of Spark is the libraries," he said. "That's what the community has invested in and where the innovation is coming from."
The Spark platform comes with four distinct libraries -- Spark SQL, Spark Streaming, a graph processing library called GraphX and a machine learning one known as MLlib -- that include pre-built algorithms and programming capabilities designed to streamline data preparation, exploration and analysis tasks. The libraries enable users to automate certain tasks and eliminate some of the coding that typically would be required.
For example, using Spark SQL to run SQL queries against data "solves some of the pain points of our existing business," said James Peng, a principal architect at the Chinese search engine vendor Baidu. "This can drastically improve the decision-making process."
Baidu has been running Spark for about a year, and Spark SQL has become one of the technology's most popular features within the company. Peng's team used it to open up Baidu's Hive tables, which pull data on the site's Web traffic from Hadoop clusters, to product managers who track popular search terms to guide sales and marketing initiatives. That enables them to track popular search terms to help direct their sales efforts.
Spark SQL speeds analytics answers
Prior to the Spark deployment, the product managers had to send queries they wanted to run to Hadoop engineers and then wait for the answers to come back, which typically took about 10 minutes. But now, the ones who know SQL can query the data in Hive themselves in just 30 seconds or so, Peng said.
The main benefit of Spark SQL is that it provides a unified way to query a variety of structured data types, including Parquet and JSON files in addition to Hive tables. The biggest downside, Peng said, was that the team at Baidu has had trouble scaling Spark SQL queries to very large batch jobs as easily as it could have in MapReduce, Hadoop's original software programming framework. He's hopeful that will be addressed in future versions of the library.
In an interview, Ali Ghodsi, vice president of engineering and product management at Databricks, said every single one of the Spark vendor's customers are using Spark SQL, making it the most popular library. Right behind it, he added, is MLlib. The machine learning library includes a set of algorithms that can be used to run various types of analytical models, such as regressions, cluster analyses and decision trees. That can lower the bar to doing some advanced forms of predictive analytics and data mining.
Christopher Burdorf, a software engineer at media and entertainment company NBCUniversal, said the analytics team there uses MLlib to help decide which media files to pull from the distribution servers that support NBC's international cable TV operations. The servers store hundreds of terabytes of files for broadcasting by local cable operators in far-flung parts of Europe and Asia, but Burdorf said that maintaining a single master list of what's needed at all times is impossible. If NBC left all its files on the servers, they would fill up to capacity quickly -- but if it takes down too many, it risks angering cable operators looking to air the removed shows.
The analytics team used to run simple internally developed algorithms that read every file on the servers and would recommend whether to take something off of them based on pre-defined features like the age of a file and the number of days since its last airing. But reading all the files was incredibly resource-intensive, according to Burdorf.
Machine learning automates analytics
To get around the problem, he and his colleagues implemented a Spark analytics application that enabled them to "train" an MLlib algorithm on what a file that should be removed from the servers looks like. That way, they can predict which files should be removed based on general characteristics. "We got a real performance speed-up because it's not having to constantly check [all] those files," Burdorf said.
He added, though, that the programmers had to test several different models before they found one that worked accurately. Additionally, once the model was put into production, they found that it was filling up servers with temporary files, which crashed the system one weekend. They had to write a separate program to clean up those files.
James Pengprincipal architect, Baidu
The Spark Streaming module is geared toward simplifying the process of capturing, analyzing and visualizing streaming data. It runs small, frequent batch jobs that collectively can process data in near-real-time, allowing users to build applications that harness streams of information.
Online automotive marketplace AutoTrader Inc. uses Spark Streaming to visualize its site traffic in order to monitor the impact of ads and check the overall health of the site after updates. For example, for this year's Super Bowl, AutoTrader's analytics team built a real-time dashboard that tracked bumps in traffic on the site after car brands aired ads. That enabled them to quantify the lift that specific advertisements delivered at a higher level of granularity than they could in previous years, when they would visualize the data in one-hour chunks.
Jon Gregg, a senior analytics engineer at AutoTrader, said the Super Bowl dashboard mainly paid off for the site in public relations value. But the company's data analysts use a similar Spark-based application to monitor Web traffic on an ongoing basis, which can help them detect potential anomalies following a site update -- if traffic crashes, chances are something is wrong. They're also considering developing dashboards that would be fed by Spark analytics data, for executives to use in monitoring site traffic.
Graph processing pulls parts together
GraphX is Spark's graph computation library; it comes with a set of algorithms that allow users to structure, search and display data in graph form, organized according to the relationships between different objects. Software developer Autodesk Inc., which makes 3D design tools for manufacturing, architecture and construction applications, uses GraphX to visually map the relationships between various parts used in creating designs. The graphing functionality allows users of the company's software to search its design catalog for specific parts while putting together a design and see other parts that typically go together.
The GraphX implementation works by mining the files for each part Autodesk has information on. It can classify parts according to their function -- as gears or springs, for example. From there, the Spark system groups parts into clusters that are commonly used together. "Because this is a graph, we can start exploring things," said Yotto Koga, a software architect at Autodesk.
As with most things in the Spark framework, though, there was a learning curve. Koga said Autodesk had an earlier homegrown system that did some of the things GraphX does, and porting the earlier classifications into Spark posed some problems.
But vendors like Databricks and the rest of the Spark community continue to develop the technology, and users are hopeful that the lingering bugs detailed at the conference will eventually be worked out. "Spark has already shown great potential," Baidu's Peng said. "But our ambition with Spark ends not just with SQL. We want it to be used for all general computation."
Using the Spark Streaming module to better understand medical claims
Spark libraries help make the framework the latest hyped-up big data tool
Databricks opens its cloud-hosted Spark platform to the public