Deploying a Spark cloud architecture may seem like a no-brainer for IT and analytics teams looking to implement...
the open source data processing engine. After all, most commercial Spark distributions come ready to run remotely in the cloud, giving users the advantage of being able to scale up their systems when demand for computing power spikes. But not all businesses have chosen that route for running big data analytics applications and other jobs through Spark.
For example, speaking at Spark Summit 2016 in San Francisco earlier this month, Joseph de Castelnau, senior vice president of software engineering at The Nielsen Co.'s marketing return on investment (ROI) applications unit, described his organization's path to on-premises Spark.
For the Nielsen unit, which is based in Evanston, Ill., the decision was born out of necessity more than anything else. The operation develops ROI calculators for marketing teams and got involved with Spark in order to move more heavily into online ad sales attribution for analyzing the effectiveness of different elements in marketing campaigns. But de Castelnau said many large retailers that work with Nielsen didn't want their marketing data stored remotely. So he and his team decided to stand up a Spark cluster on their own internally.
Initially, de Castelnau said, he didn't think it would be too difficult -- he thought of it as primarily a data engineering problem that could be solved simply by hiring more engineers. But problems quickly arose. Scaling up analytics processes developed in software from SAS Institute Inc. to run on larger distributed data sets proved challenging. Data quality was also an issue.
"We were pretty naïve," de Castelnau said. "We were used to dealing with a lot of off-line data. We figured digital data would be easier. It's not. It's much harder. It's much dirtier."
The case for Spark in the cloud
Nielsen is still working through those problems, and it has had to develop a lot of manual processes as a result. But the marketing ROI unit is now investigating Spark cloud options as well. De Castelnau said he and his team are looking into a multi-cloud arrangement that would enable individual retailers to store their data with whatever cloud service provider they like, while allowing the data from different sources to be processed for analysis by one Spark engine.
"We want to get on the cloud," de Castelnau said. "It will be easier to do end-to-end analytics."
This ability to run in multiple environments is one of the main advantages of Spark, said Nik Rouda, an analyst at Enterprise Strategy Group who also spoke at the event. Running in multi-cloud, single-cloud or on-premises architectures makes it more flexible than traditional data processing frameworks and helps kill what Rouda called the "tyranny of the hardware structure." That has decoupled computing from storage and made the particular storage engine less important, he said.
Rouda pointed to research ESG has done showing that while only 20% of enterprises today choose the cloud as their primary data platform, another 40% say they're very interested in it. The ability of platforms like Spark to fit into cloud environments is part of the reason for the growing interest, he added.
Reasons for on-premises Spark remain
Brad Peterson, executive vice president and CIO at stock-market operator Nasdaq Inc., said the financial services industry as a whole has been very slow to adopt cloud technologies due to the sensitive nature of the data it handles. New York-based Nasdaq addressed that reticence by moving cautiously into the cloud and initially rolling out Spark cloud systems for revenue cycle management and billing applications, which use less-sensitive data than its trading and transaction-clearing systems do.
The approach proved that cloud systems can outperform on-premises platforms in some cases, according to Peterson. "We looked at it and said, 'How, in financial services, can we get in there and get experience?'"
The experience gained thus far has been positive for Nasdaq, but he acknowledged that there's still resistance to the cloud for data processing and analytics uses in the broader financial services market.
Economics were once a primary argument in favor of the cloud, but that's slowly changing. Rouda said early adopters are finding that the costs of on-demand computing and storage, while initially low, can add up over time, making on-premises platforms competitive on total cost of ownership.
At the same time, though, data security -- once the biggest argument used by IT managers against cloud deployments -- is turning into a strength for cloud platforms. "When you look at the advantages, there's been a little bit of a reversal," he said.
Spark could be just what developers need for big data apps
Databricks makes big play for Spark in the cloud
Get ready for Spark to displace MapReduce for big data processing