michelangelus - Fotolia

For users, Hadoop SQL tools go hand in hand with big data framework

SQL-on-Hadoop query engines are becoming a common companion of Hadoop in big data systems as IT managers look to ease programming for analytics applications and data integration jobs.

For now, SQL-on-Hadoop adoption remains on the low side -- but it appears to be progressing at a similar pace to adoption of Hadoop itself.

That's partly evident from technology conferences, where speakers routinely have both Hadoop and SQL-on-Hadoop software in their big data environments. And it's borne out by the results of a 2015 survey on emerging technologies conducted by IT research and education outfit TDWI.

Just 16% of 320 survey respondents said their organizations were using a commercial SQL-on-Hadoop engine as part of business intelligence or analytics applications; on a separate question, 22% of 344 respondents said Hadoop SQL technology was being used for managing data. In the latter case, though, the deployment level was close to Hadoop's, both at the time of the survey and as planned within three years -- indicating that most Hadoop users also implement SQL-on-Hadoop tools (see chart).

Responses to TWDI survey on Hadoop usage

One of those users is Progressive Casualty Insurance Co. It began running a Hadoop cluster in late 2013 to process and analyze telematics data collected from customers' cars as part of a usage-based auto insurance program called Snapshot; the cluster, based on Hortonworks Inc.'s Hadoop distribution, also holds research and development data used to price insurance products, plus activity logs from Progressive's website.

Most of the analytical queries and extract, transform and load (ETL) integration processes run against the data in the cluster are created with Hive, an open source SQL-on-Hadoop tool. Progressive's business analysts and data scientists were already "heavy SQL users, so we wanted to keep the structure of the data the same for them," said Pawan Divakarla, the company's data and analytics business leader. "We have an entire business community that uses the data, and there was no point making them learn something else."

Some helping hands for Hive

IT architect Chris Barendt said Progressive encountered some bugs in Hive early on after deploying it, but the technology now is more stable and "does most of what we need." But to help increase Hive's performance and enable it to support interactive querying, the Mayfield Village, Ohio, insurer has augmented the Hadoop SQL engine with Tez, an open source application framework that originated at Hortonworks and was designed for optimizing data processing throughput in Hadoop systems.

Also, most of the data analysts at Progressive work with Hive through Tableau's BI software or Hue, a user interface for web applications that was originally developed by Hadoop vendor Cloudera Inc. Only some power users do direct command-line SQL coding in Hive, Barendt said.

Online dating service Zoosk Inc. uses a combination of Hive for ETL and Cloudera's Impala SQL query software for analytics, also with Hue and Tableau on the front end to ease coding for its data analysts. The San Francisco company deployed a Cloudera-based Hadoop cluster in 2012, initially to process the large volumes of user activity and system log data generated by its website and then pass aggregated views of the information to an enterprise data warehouse built on Microsoft's SQL Server database.

Martin Lam, who was senior director of analytics and data science at Zoosk until earlier this year, said while still at the company that its developers first tried to do programming in MapReduce, Hadoop's original processing environment. But that took much more time than using SQL did -- a couple hours to write and test a job versus a couple minutes. Zoosk then implemented Hive to support ETL processing in the cluster. However, the Hadoop SQL tool was too slow to support ad hoc analysis of the data, according to Lam. As a result, "we didn't use Hadoop as an analytical platform," he said.

A match made in Hadoop SQL heaven?

That changed after Cloudera released a beta version of Impala in late 2012. Zoosk signed on as an early user, and Lam said the query engine's performance has made analyzing the raw Hadoop data more feasible.

For example, he said that in combination with Parquet, a columnar storage format for Hadoop that Zoosk added to its big data architecture in mid-2015, Impala can run a typical query on website user interactions in eight seconds; by comparison, the same query takes just under eight minutes with Hive alone and nearly six minutes with Hive and Parquet together. The faster speeds have also enabled the addition of more-advanced analytics applications -- in particular, a behavioral matchmaking one that aims to predict possible matches between Zoosk users based on their use of the site.

You definitely see that you're on the bleeding edge of a language as it's being developed.
Benny Blumvice president of product and data, Sellpoints Inc.

Zoosk is sticking with Hive on the ETL side, though, for processing the hundreds of millions of rows of data it captures from the website daily, an incoming flow that has amounted to about 200 TB of info in the cluster altogether. "Impala provides a lot of speed, but it can be unpredictable if you aren't careful," Lam said. "It's easier to guarantee that Hive will finish a job."

Impala also still lacks some standard SQL features, including support for extensible markup language and JavaScript Object Notation functions and for nonscalar data types such as maps and arrays. "But the gap is getting narrower and narrower," said Lam, who now works at Facebook. "For most [analytics] use cases, you can get by."

The same applies to the Apache Spark processing engine's Spark SQL module, said Benny Blum, vice president of product and data at Sellpoints Inc., an online marketing and advertising company in Emeryville, Calif., that uses Spark SQL for ETL processing.

"There are things that I can't do in Spark SQL now," Blum said. "You definitely see that you're on the bleeding edge of a language as it's being developed." On the other hand, he noted that the query speeds Sellpoints is seeing with Spark SQL "are astronomically higher than they were six to eight months ago," due to improvements in the SQL technology and more effective use of Spark overall by the company. 

Next Steps

Three basic things to know about SQL-on-Hadoop technologies

Start with the use case in evaluating your Hadoop SQL software options

Big data users and vendors look to raise their Hadoop management game

Dig Deeper on Big data analytics