agsandrew - Fotolia

Databricks looks to simplify Spark computing via auto-config option

Databricks brings new features to its managed Spark platform -- as well as to open source Spark -- that it hopes will make the computing engine more widely usable.

One thing has been true about Apache Spark since its early days: It is very complicated to use.

Recognizing this barrier to entry, Databricks Inc., the driving force behind Spark's development and one of the vendors offering Spark computing as a service, is looking to automate away some of that complexity for users.

At the Spark Summit 2017 conference in San Francisco, Databricks announced a new product called Serverless, which takes the company's notion of a managed Spark platform a step further.

Prior to Serverless, Databricks' value proposition was essentially that it would manage the installation of Spark on servers and then provide access to these managed servers via the cloud. But once customers acquired Spark resources through Databricks, it was up to them to configure the software on their clusters. This meant allocating resources to specific workloads and defining other configuration settings related to things like data storage and security.

With Serverless, Databricks is doing more of the configuration. Enterprises can tell Databricks how large they want their clusters to be, and the service automatically distributes resources to workloads as they come into the cluster.

Matei Zaharia, CTO at San Francisco-based Databricks and creator of Apache Spark, said the goal is to make Spark less of a tool for data engineers and to open it up to data scientists and general business analysts. Zaharia said that by automating more of the configuration of Spark computing clusters and simplifying the process for executing jobs, he thinks a greater number of employees within a company will be able to use the tool.

"This is one of the main reasons we wanted to do this," he said. "As the set of users who want to use it grows, you need to make it much simpler to get started."

Pricing for Serverless is the same as Databricks' traditional offering, which is $0.20 per Databricks unit -- a unit of processing capability per hour -- for data engineering jobs. It's $0.40 per Databricks unit for analytics workloads.  

Zaharia and his team also announced a new machine learning library addition to open source Spark. Along the same lines as the Serverless announcement, the new library is aimed at lowering the bar to the complicated world of deep learning and artificial intelligence.

The new library adds a pipeline operator for popular machine learning tools TensorFlow and Keras. The two tools have gained traction for machine learning in part due to their simple interfaces. The addition of the new pipeline operator to Spark allows users to develop models in these interfaces using Spark computing as the data processing back end.

Zaharia said this helps address the scaling problem that a lot of users see in machine learning. Many of the tools used to develop machine learning models are desktop or open source software. It often requires some recoding to take a model from these tools and put them into production. Developing models in tools that tie into an enterprise-scale Spark cluster for data processing minimizes this issue.

"Deep learning is very powerful, but it takes a ton of work," Zaharia said. "We think that with these high-level APIs you can get the same kind of results, just much faster."

Next Steps

Spark adoption is growing fast despite some functionality gaps

Spark still has maturing to do to be more of an enterprise tool

Spark plays big role in big data environments

Dig Deeper on Big data analytics