Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
Satellite images add up to a lot of data -- in DigitalGlobe Inc.'s case, 50 TB of geospatial imagery on a daily basis. The imaging products and services provider augments that with text data from 10 to 15 million geo-tagged Twitter posts every day in an effort to get a better sense of what's happening on the ground worldwide, to aid both in assessing what its satellites are capturing and in pointing them toward new geographic hotspots.
Quickly mining and analyzing all that information has been a challenge, though -- the process is largely manual now, and the territory covered by the images is so vast that analysts "can only do so much putting eyeballs on it," said Tony Frazier, a senior vice president at DigitalGlobe. But the Longmont, Colo., company is looking to speed things up by streaming data for analysis through a combination of big data, stream processing and cloud computing technologies.
Last November, DigitalGlobe started beta-testing a more real-time analytics service that's powered by a Hadoop cluster based on Cloudera's distribution of the open source distributed processing framework. The InsightCloud service, due for commercial release in mid-2015, uses the data streaming capabilities of the Apache Spark processing engine plus tools such as the Storm real-time computing system and HBase database to support algorithm-based analytics. The goal, Frazier said, is to make DigitalGlobe's data resources more useful for driving operational decisions by its customers, which include the likes of military services and other government agencies, global development organizations, oil and gas companies, and infrastructure security businesses.
"What we want to do is go from reporting on events to anticipating them, and ultimately changing outcomes," Frazier said. "The more we can get ahead of those events, the more we can equip the people who are going to take action." For example, he noted that if a military special-forces team is being sent on a mission into an urban setting, "you want to be able to say who can see them there or pinpoint safe places to land" based on current information.
He also cited efforts to police fishing exclusion zones in the world's oceans and seas; because of the vast amount of geographic space that's involved, having data analysts comb through satellite images in search of illegal vessels could take "days or weeks to get an answer." More prosaically, real-time counting of cars in parking lots at shopping malls could provide an indicator of retail traffic, said Frazier, who was general manager of DigitalGlobe's Insight analytics business until the end of 2014 and is now in charge of its offerings for U.S. government agencies.
Tony Fraziersenior vice president, DigitalGlobe Inc.
DigitalGlobe previously tried to accelerate its analytics throughput by adding video processing hardware. But that was an expensive approach, according to Frazier, who said tapping the Spark software's stream processing module on top of Hadoop should be much more cost-effective. Initially, the company is running all the tweets it collects through the Spark Streaming technology as it works with beta users of InsightCloud, which can be deployed on-premises, in the Amazon Web Services cloud or in setups hosted at DigitalGlobe's data centers. But Frazier said the company plans to add different types of satellite images during the course of 2015 and have all of its data streaming through Spark by year's end.
Data stream not clogged with swimmers
Stream processing is still a niche application, even among big data users. For example, in a survey conducted last June by consultancy Gartner Inc., only 22% of the 218 respondents with active or planned big data initiatives said they were using stream or complex event processing technologies or had plans to do so (see chart). That percentage was unchanged from a similar survey the year before.
But Gartner analyst Nick Heudecker said he sees "a pretty bright horizon" for streaming data platforms, partly because of the development of new technologies like Spark and Storm that can take advantage of distributed Hadoop clusters built on relatively low-cost commodity servers.
William McKnight, president of McKnight Consulting Group, said the increasing adoption of data science techniques and advanced analytics tools will also create "a more voracious need for timely data" in organizations. But he thinks that process likely will play out slowly, resulting in broader adoption of streaming data processing software over the long haul. "These products are certainly not something that everybody has to have," McKnight said. "You've got to have a real pressing need for real-time [analytics] in order for it to make sense."
Uses that are a good fit include various finance applications, such as stock trading, fraud detection and regulatory compliance monitoring, Heudecker and McKnight said. Another potential use is in real-time customer engagement programs -- targeting ads or promotional offers to individual customers while they're browsing a retailer's website or talking to a customer service representative on the phone.
Heudecker also pointed to applications involving streams of data from sensors on industrial equipment, pipelines and other machinery connected to the so-called Internet of Things. He said the IoT could be "a substantial factor" in driving the growth of stream processing as organizations look to take advantage of machine data to do predictive maintenance and spot possible problems before equipment fails.
Firehose of options for streaming data
And there are plenty of data streaming technologies to choose from once a company decides to take the plunge. In addition to the open source tools connected to Hadoop, vendors like IBM, Informatica, SAP, Tibco Software and Vitria Technology sell more traditional complex event processing platforms that have evolved to support big data applications. In the cloud, AWS, Google, Microsoft and others offer stream processing services. Some streaming analytics specialists are also pushing products. Heudecker counts "literally dozens" of vendors in the market. "There are so many options," he said, "that it can be hard to balance what you want to do with the technology that's available."
But the potential benefits are as big as the data involved. At DigitalGlobe, other real-time analytics uses that Frazier said could be enabled by stream processing of image data include identifying damaged buildings after natural disasters and pinpointing the locations of dwellings in remote areas of undeveloped countries to aid in planning vaccination programs.
"It's a big world, and there are a lot of bad things happening," he said. "We want to be able to figure out when something is emerging, or trending in the wrong direction. And with these unconventional problems, you need unconventional tools to address them."
Successful big data analytics efforts start with finding the right data to analyze
Why the Spark processing engine is getting so much buzz -- and what could hold it back
Get real-world advice and examples in our guide to managing big data analytics programs
Learn more about Oracle Big Data Discovery
For the cloud and the data processing pipeline, it's complicated