agsandrew - Fotolia

Spark big data framework powers speedy analytics

The Apache Spark engine can be a powerful tool for encouraging big data adoption among front-line workers thanks to its fast processing speeds.

The Spark big data distributed computing framework generally gets a lot of attention from data engineers, but so far that's mainly where its appeal has stopped. But users are saying it has one major feature that should help it garner broader appeal: Speed.

Businesses are increasingly moving toward self-service analytics applications that tend to be easy to operate. Ease of use is typically seen as one of the biggest factors for organization-wide adoption, but at the Spark Summit 2015 conference, which took place last week in San Francisco, early adopters of the computing framework said that speed may actually be a bigger selling point for getting front-line workers to use data.

"They have to fail fast, they have to iterate," said Gloria Lau, vice president at Timeful, maker of a smart scheduling service that Google recently purchased. "They visualize, they fail again. Iteration is very rewarding. You have to trust that non-engineers are very capable."

While Spark may require intense technical skills to manage its clusters on the back end, the open source technology is relatively user-friendly on the front end. Apache Spark comes with a Spark SQL library that gives users tools to query a variety of data stores using SQL, Java and the R analytics language, and developers can create even more simplified front-end applications that run on Spark using those tools.

In-memory boosts app speed

Since Spark processes data in-memory, any application running in the environment has the benefit of speed. Its creators say it can process data up to 100 times faster than MapReduce, Hadoop's original processing engine, when running jobs in memory and up to 10 times faster when running them on disk.

Lau said for less tech-savvy users, that kind of speed is critical. The typical data consumer isn't interested in spinning up a job that takes 10 minutes to process. They're used to querying services like Google that give them answers almost instantly.

"What you want is to democratize your data," Lau said. "You want everyone to access your data and form their own insights. Speed is the only thing here you should care about."  

Brian Kursar, a senior data scientist at Toyota Motor Sales U.S.A., said the speed of Spark helped him and his team develop widely used reports that quantify the public's perception of the Toyota brand on social media. They built a machine learning application based on pre-written algorithms in Spark's machine learning library, known as MLlib. But it took several iterations before they landed on something that had strong accuracy.

The ability to get through this process quickly and deliver something accurate played a big role in getting executives to support the project and use its output, Kursar said.

"When you're working on a product where you're trying to improve the accuracy of models, your ability is going to be limited by something that" does not provide computing power and speed, he said.

NASA uses Spark for data access

Chris Mattmann, chief architect at the NASA Jet Propulsion Laboratory, said he and his team are working on developing a data processing system based on Spark that aims to give researchers access to data stored in disparate file systems.

A lot of scientific data created by NASA and its partners gets locked away in data systems and file types that are specific to the scientific community and can be difficult to access with common tools. Additionally, researchers who can access current data stores have a hard time processing jobs because each query needs to pull data out of data stores. Nothing is held in memory.

But the in-memory processing capabilities of Spark will allow the agency to give rapid access to researchers regardless of the front-end tool they use.

"We should be able to do all this interactively," Mattmann said. "It should do ETL and put it into memory automatically."

Ed Burns is site editor of SearchBusinessAnalytics. Email him at [email protected] and follow him on Twitter: @EdBurnsTT.

Next Steps

Some say Spark is the next big thing in big data

Take a peak under the hood of the Spark big data framework

Spark draws interest from big data vendors

Dig Deeper on Big data analytics