Using big data platforms for data management, access and analytics
A comprehensive collection of articles, videos and more, hand-picked by our editors
When I worked at a fast-food restaurant in high school, a co-worker friend and I decided its motto should be "Speed, Not Perfection." We silkscreened t-shirts for the two of us with that phrase embedded in the corporate logo -- two smart-aleck teenagers gently sticking it to the man.
Nowadays, data management and analytics teams increasingly find themselves being asked to fulfill the speed part to enable real-time data analysis in their organizations. But they don't have the luxury of being able to get by with the same sort of occasional sloppiness that my friend and I did in slapping burgers together. And that puts them under a lot of pressure, because creating a real-time architecture and using it to run streaming data analytics applications is a complicated undertaking.
For starters, streaming analytics systems don't come in a box -- not even a large one. Setting them up is an artisanal process that requires prospective users to piece together various data processing technologies and analytics tools to meet their particular application needs. In addition, the technology options have increased significantly over the past few years, thanks largely to the emergence of multiple big data platforms that provide stream processing capabilities in different ways.
A plethora of streaming platforms
Spark Streaming, Flink, Storm, Samza, Pulsar, Druid, Kylin -- they're all open source processing engines vying for a piece of the data streaming and real-time analytics action. Even Kafka, originally a messaging technology for feeding data from one system to another, now also functions as a stream processing platform in its own right. In addition to the open source tools, various IT vendors offer more traditional complex event processing systems that began emerging in the late 1990s. Specialized databases -- in-memory ones, for example -- are also built to handle streaming data analytics.
On the analytics software side, broader use of machine learning algorithms is making it more feasible to build predictive models that can churn through large amounts of streaming data on things like financial transactions, equipment performance and internet clickstreams. But again, there are a multitude of technology choices to consider: tools from mainstream analytics vendors and machine learning specialists, cloud-based services, open source platforms.
As with building a big data architecture in general, the surfeit of software available to underpin a real-time analytics architecture can be a boon for users -- or mire them in a veritable boondoggle of a deployment. Finding the right technologies and combining them into an effective analytics framework is a perilous process; missteps can send a project careening off the intended path.
Streaming forward on real-time projects
That isn't stopping companies, particularly large ones with lots of data and ample IT resources, from giving it a go. In an ongoing survey being conducted by SearchBusinessAnalytics publisher TechTarget Inc., 28.1% of the 7,000-plus IT, analytics and business professionals who had responded as of mid-January said their organizations were looking to invest in real-time analytics technology over the ensuing 12 months. In addition, 13.4% said they planned to buy stream processing software.
Why do it? The ability to pull useful information out of data streams in real time lets business operations act fast, and that clearly can be to their advantage. Predictive analytics applications run against streaming data on the web activity of consumers can drive website personalization programs and targeted online advertising and marketing campaigns. Fraud detection, predictive maintenance and satellite imaging are other applications that can benefit from streaming data analytics.
In many cases, real time might be the only time to take advantage of what's in the data being collected. Streaming analytics tools point to "perishable insights" that need to be acted on quickly before the opportunity is lost, Forrester Research analyst Mike Gualtieri and then-colleague Rowan Curran wrote in a 2016 Forrester Wave report. And you can't get those kinds of insights simply by throwing data into a Hadoop cluster, as Darryl Smith, chief data platform architect at Dell EMC, said during a presentation on the data storage vendor's real-time streaming efforts at Strata + Hadoop World 2016 in New York.
Speed is indeed a wonderful thing. Just be sure your team has a well-thought-out plan before turning up the heat on a streaming analytics initiative. Otherwise, it might end up getting flame-grilled by disappointed business executives.
Online ad companies get an analytics boost via data streaming in Spark
Solid data architecture needed upfront for real-time IoT analytics
More advice on how to make streaming data analytics systems work