Kit Wai Chan - Fotolia

Databricks pushes its Spark big data platform to the masses

After nearly a year of limited availability, software vendor Databricks has made its cloud-based version of the Spark processing engine generally available.

Databricks Cloud, a cloud-based commercial version of the Spark big data processing engine, is now generally available, and it packs a few updates aimed at satisfying demands for functionality from the data scientist community.

When vendor Databricks Inc. started a limited availability release last June, the company hand-selected customers based on whether it thought they had use cases compatible with its cloud infrastructure. Along the way, it built up a long waiting list of organizations that wanted the product. Now, as of today, Databricks will sell the Databricks Cloud service to all comers.

The generally available version of Databricks Cloud is based on Apache Spark 1.4, an update of the open source technology released last week. The most prominent addition included in the 1.4 release is an interface for the R language, something Spark users have been calling for and that Spark developers have been promising since last summer. R is one of the most popular analytical programming languages among data scientists, and integration with Spark will allow it to be used to build and run applications on a diverse array of large data stores.

Better collaboration, more control

Spark 1.4 also includes built-in integration to the version control site GitHub, which allows multiple developers to track changes to projects, whether they involve analytics algorithms or application development, supporting improved collaboration. In addition, the new release gives IT administrators the ability to assign end users to role-based groups for improved access control. Databricks said it will add support for those features to its product offering during the second half of the year.

Databricks was co-founded by Spark creator Matei Zaharia, and the Berkeley, Calif., company is one of the chief contributors to the Spark open source project within The Apache Software Foundation. Initially, Databricks is running its version of the platform on the Amazon Web Services cloud, and one of the main draws of Databricks Cloud is that it gives users access to Spark's feature set -- including its ability to process data in-memory -- without requiring them to manage the installation themselves.

Benny Blum, Sellpoints

Benny Blum is among those users. Blum is vice president of product and data science at Databricks customer Sellpoints Inc., an e-commerce optimization services provider in Emeryville, Calif., that helps companies drive more traffic to their websites and better target their online advertising to potential customers. He said he likes the features of Spark but doesn't want to have to internalize management of the technology, which can become relatively complex and require a significant time investment.

"We could stand up our own clusters and run Spark," Blum said. "But Spark is pretty raw and it requires a lot of resources to make sure [the clusters are] doing what they're supposed to be doing."

Back to the old ways with Spark and R

The integration with R is another attractive feature for Blum. He said Sellpoints did most of its data analysis in R prior to bringing in Databricks Cloud, and a lot of the company's data scientists liked the language. But since R wasn't supported in previous versions of the Spark big data engine, it was taken off the table for them when Sellpoints implemented the Databricks technology at the beginning of this year. They're now looking for specific projects in which R could be re-implemented.

We could stand up our own clusters and run Spark. But Spark is pretty raw and it requires a lot of resources to make sure [the clusters are] doing what they're supposed to be doing.
Benny Blumvice president of product and data science, Sellpoints Inc.

The R support could also address one of the main flaws Blum has seen in Spark. Since the platform is primarily designed to process large volumes of data, he said its library of machine learning algorithms can be difficult to implement for smaller jobs that require flexibility, such as applications that are still being developed and may need to be tested and updated frequently before they're put into production use. R was originally designed to process jobs in-memory on a single machine, so it is more geared toward supporting that kind of development flexibility. The new interface could help bridge the divide for data scientists and other end users, according to Blum.

"One reality is that Spark is designed to scale, so the machine learning libraries in Spark are limited to those [applications] that can scale," he said.

Ed Burns is site editor of SearchBusinessAnalytics. Email him at [email protected] and follow him on Twitter: @EdBurnsTT.

Next Steps

A peak under the hood of the Spark big data framework

Big data vendors jump on board the Spark bandwagon

Interest continues to grow in the Spark platform

Spark matters for data analytics

Dig Deeper on Big data analytics