Apache Spark landed on the map of many data professionals on May 30 when the Apache Software Foundation announced the 1.0 release of the open-source platform. Spark has since continued to grab headlines, but is it ready for enterprise prime time?
Listening to the speakers at last week's Spark Summit, the answer seems to be yes, though reality may be more complicated. Spark is often described as a runtime environment, sitting on top of data stores like Hadoop, NoSQL databases, Amazon Web Services (AWS) and relational databases, and acting as an application programming interface (API) that allows programmers to manipulate data through common applications. Spark comes with a few applications, including an SQL query engine, a library of machine learning algorithms, a graph processing engine and a streaming data processing engine.
There's an opportunity for Spark to become the "Lingua Franca" for big data, said Eric Baldeschwieler, a technology advisor and former co-founder and chief technology officer at Hortonworks. Hortonworks is one of several technology vendors that have incorporated Spark into their distributions of Hadoop, including Cloudera, IBM, MapR and Pivotal.
Therein lays a major part of Spark's promise. Proponents say it complements Hadoop while also taking the functionality of the much-hyped file system beyond what it can do on its own. Spark advocates say no other platform provides such comprehensive integration of these disparate technologies and functions.
M.C. Srivas, CTO and co-founder of Hadoop distribution vendor MapR, is particularly excited about Spark paired with Hadoop. He says it offers an alternative to the clunky and much-maligned MapReduce language and, since Spark can process data in-memory, it enables real-time data processing on Hadoop.
Eric Baldeschwielertechnology advisor and former cofounder and CTO, Hortonworks
"Spark and Hadoop is a winning combination," he said. "APIs are really beautiful. The other thing is in-memory performance. MapReduce forces you to go to disc. There are some things that can be done better in-memory. Real-time is here with Hadoop now. Spark brings it."
Most of the chatter around Spark has been about its ability to integrate disparate data source and provide a single, simple interface. But it's beginning to offer more to data scientists who are less interested in the heavy lifting of data management.
Patrick Wendell, a software engineer at Databricks, the vendor that is leading Spark development, said the 1.0 release included 15 pre-defined machine-learning algorithms in its Machine Learning Library (MLlib). That is expected to double with the 1.1 release. Looking further down the road, he said developers are working on an interface for the R programming language, which he said may come in the 1.3 release. Even though Spark has gained fame as a data management tool, Wendell thinks it is all about these analytic libraries going forward.
"The future of Spark is the libraries," Wendell said. "That's what the community has invested in and where the innovation is coming from. We're betting the future of Spark on these libraries."
Does all this mean enterprises should start planning their own Spark implementations? It may be too early for that. The idea of a single API for interacting with and managing streaming and batch data, as well as running both advanced analytics and simpler reporting functions against that data is appealing. Users today are frustrated with the broad array of tools necessary for managing, analyzing and reporting data. But Spark still has holes.
Srivas said in-memory computing tends to have reliability issues. Spark claims to solve this with its Resilient Distributed Dataset, which provides a fail-safe by operating on data in parallel. Additionally, Baldeschwieler said Spark must expand the number of data stores it can operate on, provide stronger avenues for code sharing to speed sharing of best practices, develop a code portability layer so that programmers can write a job once and execute it against several different data stores, and finally get around to producing an R interface.
"There's lots of opportunities to make it better, but I think Apache Spark is the most exciting thing happening in big data today," Baldeschwieler said.
Spark offers a new take on Hadoop 2
Spark goes 1.0, looks to improve on MapReduce performance