photobank.kiev.ua - Fotolia
To many people, Hadoop has become nearly synonymous with big data. It's well-suited for handling the three Vs from the popular definition of big data: volume, velocity and variety. But when it comes to doing more iterative data science work, such as building predictive models or data visualizations, the distributed processing framework often has less of a direct role to play.
The Hadoop Distributed File System (HDFS) is adept at stashing huge amounts of varied data types, and in recent years a variety of open source projects and commercial technologies have been developed in an effort to make it easier to get data out of Hadoop for analytics uses -- new SQL on Hadoop query engines, for example. Several users of Hewlett-Packard's Vertica analytical database, though, say those tools aren't up to the tasks of predictive modeling or visualizing data in their organizations, which can limit the potential benefits of using Hadoop for analytics applications.
"Hadoop is a batch-oriented system, and as much as they try to put Pig and Hive on top, it's still not ready for prime time," said Chris Bohn, a senior database engineer at Etsy Inc. during a session at the HP Big Data Conference 2015 in Boston this month. "I think it'll be useful if it gets there, but will it be flexible enough to let you run query after query? Right now it's not there."
Bohn added that he doesn't think data used for predictive modeling should be deposited in Hadoop. It's going to be difficult for analysts to get the data out of HDFS, even if there's a query engine in place, he said. And anything that separates analysts from the data slows the time to insights and diminishes the value of the analysts to the business.
That's why Etsy, a Brooklyn, N.Y., company that operates an online crafts and vintage goods marketplace, went with a Vertica database for all of its modeling and a Hadoop cluster for storing less immediately useful data."You can't be a data hoarder," Bohn said. "When analysts can get the data themselves, that's a better use of their time."
Hadoop usage falls to data engineers
At DeNA Co., a Japanese Web portal and e-commerce site, data analysts used to run into similar problems on more basic business intelligence and analytics applications. Kenshin Yamada, general manager of the company's analytics infrastructure department, said at the HP conference that all of the company's clickstream data was in a Hadoop cluster. But this made it hard for analysts to produce traffic reports or analyze the popularity of various types of content. A data engineer had to write a query to get analysts the data they needed for each new report out of Hadoop.
In 2013, DeNA added a Vertica database to augment its Hadoop implementation. Yamada said that made the data much more accessible and shortened the time it took for analysts to get the information they needed. The new setup is much more supportive of iterative data science work than using Hadoop for analytics, he said, because queries execute much faster in the Vertica system, allowing the analysts to test a variety of hypotheses in a relatively short period of time.
Data analysts "shouldn't have to go through Hadoop just to create a KPI dashboard," Yamada said, referring to key performance indicators.
Ties between Hadoop, R not binding
For Anmol Walia, a senior applied researcher at customer service contractor 24/7 Customer Inc., Vertica plays a similar role. The Campbell, Calif., company, which operates under the brand name 7, ingests clickstream data and customer records from its clients and uses the information to predict which customers will need assistance while browsing e-commerce sites so it can proactively intervene. Everything first gets dumped into Hadoop, but the models that predict customer needs are built in Vertica using data pulled out of Hadoop specifically for that purpose.
One reason for this approach, Walia said, is that Vertica supports the R programming language, which is used by most of the company's data analysts. Conversely, there's no simple integration option between R and Hadoop, according to Walia.
It is possible to integrate the two open source tools, he said, but they operate on fundamentally different wavelengths. At its core, Hadoop is a distributed file system, whereas R is single-threaded, designed for processing jobs on a single CPU. The workarounds required to pull them together typically involve a lot of programming, Walia said.
Predictive modeling demands speed, not precision
Big data has big impact on predictive modeling projects
Get the business involved when building predictive models