In recent years, big data analytics has become almost synonymous with predictive analytics. As a result, there's...
a growing presumption that any system used for predictive analytics must involve big data, and that any big data system will surely support predictive modeling.
In reality, though, the two things aren't one and the same. And while access to massive data volumes and new types of data can significantly enhance the ability to develop good predictive models, analytics managers and their teams need to consider the fundamental aspects of what makes data big and how the challenges of managing it affect predictive analytics in big data environments.
First, let's examine the predictive analytics process itself. The popular perception of predictive analytics involves some sort of statistical analysis or pattern matching that is integrated into a business application to automatically drive operational decisions and actions. But implementing predictive models requires a number of steps, including the following:
- Data preparation to cleanse, transform and reorganize data into a format suitable for predictive analytics or machine learning algorithms. This involves profiling the data, looking for anomalies, determining what types of data quality standards to apply and what corrections to make, devising a data model suitable for analysis, and performing the transformations needed to make data sets consistent.
- Predictive model development, in which a training data set is created and subjected to selected algorithms, resulting in some number of analytical models to be tested. This step requires a plan for splitting the data that's being analyzed into various subsets, including the training set and one or more test sets.
- Testing, in which the various models are run against the test data sets and their performance is measured and evaluated to determine which model produces the best results.
- Integration and implementation, in which the most accurate model is incorporated into a production business process and run for real to generate analytical findings and recommend actions.
- Tweaking of the chosen predictive model to ensure its continued validity and accurate performance, with corresponding updates based on repeated analyses.
Big data creates unique challenges
Next, let's look at things in the context of the famous 3Vs of big data -- volume, variety and velocity -- and contemplate some specific challenges that must be addressed to effectively implement predictive analytics in big data environments.
Data volume. Aside from the obvious considerations related to managing often massive data volumes -- ingestion, staging and preventing data latency -- you must have streamlined processes to support the different stages of the analytics process. For example, you need to be able to extract a training data set that can be rapidly analyzed using the different candidate algorithms, but also one that adequately reflects the full set of data.
Data variety. Businesses are increasingly presented with a wide variety of data inputs, ranging from conventional structured data to a growing number of unstructured data types. And, as more unstructured data streams become integral to business processes -- for example, continuous monitoring of Twitter streams to identify customer sentiment -- they're becoming necessary data sources for predictive models. That means you must have a set of robust processes for scanning, parsing and contextualizing unstructured data to transform it into data sets that can serve as fodder for analytics algorithms.
Data velocity. The complexity of dealing with large volumes of varied data is compounded by the accelerating speeds with which those data streams are being delivered. Not only must you be able to deal with ever-faster feeds of incoming data, there often is no predictability as to when the structure or format of those data feeds might change, forcing an almost continual need for data profiling and preparation.
Be smart about your analytics strategy
Design your strategy for predictive analytics in big data systems to address these challenges so you can successfully manage -- or finesse -- the critical points in the process.
For example, consider the challenge of boiling down a massive data set into a reasonable training one. In some cases, the best approach would be to use filters to reduce the data set's size, perhaps eliminating records that aren't part of common use cases, before randomly selecting the training set. In other cases, the goal might be to ramp up the big data system's compute resources to enable the analytics algorithms to handle a much larger training set -- and to eliminate the need to filter out any records.
As another example, addressing data velocity challenges might mean scaling up the system's streaming data ingestion capabilities so that each data feed can be run through the predictive models in full, or to reduce the complexity of the models so that they can execute faster.
Each of these choices involves some give and take when it comes to design, engineering, complexity and cost. A more precise set of predictive models might require more processing and storage resources, but the analytics benefits could outweigh the added costs. Alternatively, your organization might be able to get what it needs from predictive analytics in big data applications from less complex models that don't require processing reinforcements.
Predictive analytics must mesh with big data processing to produce the results that analytics managers -- and corporate executives -- are looking for. To make that happen, it's imperative to figure out how to balance the performance and management demands of big data with the opportunities afforded by predictive analytics.