The thing about big data is, no matter how big it gets, users always want more.
That's one of the lessons CardinalCommerce Corp. learned on its journey toward enhanced big data analytics capabilities. The Mentor, Ohio-based company, which was purchased by Visa in 2017, offers a service to online merchants to confirm the identity of purchasers using alternative payment platforms, like PayPal. Needless to say, online financial transactions generate huge amounts of data, and generating more insights from that resource has been both a top priority and a central challenge.
"We run a lot of transactions daily," said Christopher Baird, data systems manager at CardinalCommerce. "We have metric data we collect, log data, all kinds of things. We have to bring it back to an environment for reporting."
A few years ago, as Apache Spark was gaining prominence, Baird's team members built a small cluster on premises to run Spark for basic data processing tasks, like bringing data back from CardinalCommerce's web payment processing platform for reporting. They primarily used Microsoft SQL Server Reporting Services software to track data quality issues in the XML message format that the platform uses to authenticate online shoppers. They saw good results, but realized the cluster was too small to scale to larger use cases.
"It quickly became apparent that it was just not sufficient for all the data we had available," Baird said.
Spark in the cloud improves scalability
So CardinalCommerce decided to move the Spark workloads to Amazon's Elastic MapReduce (EMR) big data service in the cloud. Doing so gave the team more flexibility to scale Spark to larger workloads as needed. But it also created other problems in connection with the company's big data analytics capabilities.
Pricing was complicated and, as more team members got involved in Spark jobs, getting everyone on the same page was a challenge, Baird said. As Spark clusters were spun up for each new job and then taken down when jobs were completed, team members needed access to constantly changing Apache Zeppelin notebooks, which the team was using in EMR as an analytics front end to analyze Spark data.
During this time, Baird and his team wanted to make data in Spark available throughout the company. In particular, that meant opening up the data to the merchant support team so it could report to clients on transactions processed through CardinalCommerce's platform. "We had this mission of getting people who weren't on our team to use the product," Baird said.
The effort led his team to Databricks' Spark platform. They made the move, Baird said, partly because Databricks offers a simplified user interface compared to EMR -- and a pricing structure that makes it easier to spin up Spark clusters as needed and know the cost ahead of time. Now, anyone with basic SQL skills can query data in Spark, he added.
The move wasn't without tradeoffs, however. Databricks is more expensive than EMR, according to Baird. But reducing complexity in the company's big data analytics capabilities made the added costs worthwhile, he said.
Big data success sparks more demand
As often happens in big data environments, though, success breeds a demand for greater use cases. Once the platform was up and running and delivering consistent results for structured reports and ad hoc SQL queries, Baird's team decided to run Tableau's data visualization software on Databricks. Baird said the results were initially variable.
The problem was that people had access to vast troves of data and wanted to analyze it all. However, Tableau can choke on such large data volumes. Visualizations were slow to render and performance was poor.
The team has since worked with users on being more selective about the data they bring into visualizations, but to some degree, the situation reflects the nature of self-service analytics: Once people get a taste of data, they generally want more. "People want to make visualizations with a huge timeframe covering lots of data," Baird said. "That's not going to be a very efficient query."