Predictive analytics projects can bolster business decisions

News Stay informed about the latest enterprise technology news and product updates.

Predictive analysis for business requires the right data

More data doesn't always benefit predictive analytics projects. Data sources must be scrutinized and understood before being used in a predictive model, experts warn at the AI World conference.

BOSTON -- When doing predictive analysis for business decision-making, the output of models is only ever as good...

as the data that goes into it.

This may sound obvious, but in this era of big data, where massive data sets are available to analytics users throughout the enterprise and self-service analytics tools are proliferating, there can be a propensity to build predictive models around any data at hand. The need to understand what's in a data set and where it came from, as well as to ensure high-data quality standards, has never been greater.

"There's just an extraordinary amount of information being created, and it's just too much for a human to process," said David Ledbetter, a data scientist at Children's Hospital Los Angeles. "How do you determine what's important and direct doctors to specific patients and recommend treatments?"

Ledbetter spoke in a panel discussion at this week's AI World 2017 Conference & Expo here. He said patients in the hospital's pediatric intensive care unit can have up to 500 sensors generating data about them every second. Finding a kernel of information in the sea of data that will have predictive value can be challenging.

Get to know your data creators

One of the ways his team deals with this difficulty is by working closely with the clinical staffers who work with the monitors that create the data. Of course, the data science team could develop a deep learning algorithm that crunches through all the data and zeros in on the most predictive data features and then build a predictive score around that. But that takes time and compute power. In this case, it's easier to lean on the clinical knowledge of the medical staff to identify important signals in the data.

"Every time we have a conversation with the clinical team we learn new and interesting things," Ledbetter said. "There are numbers in our data sets that you can't even begin to understand the meaning of until you talk with the clinical team."

Of course, none of this is specific to healthcare. Getting access to the right data at the right time is critical for any project aimed at using predictive analysis for business decision-making.

At the conference, Marc Hammons, a principal software architect at Dell, said his team maintains models aimed at predicting failures in the hardware that the computer company sells, enabling predictive maintenance. His team has access to data from many sources, primarily logs that describe the state and usage of hardware.

Data volume needs change by project

Many data science projects benefit from large volumes of data, but Hammons said the one at Dell doesn't require true big data. Instead, it works fine by applying traditional statistical methods to smaller data sets.

"We're finding that a lot of the problems we need to solve we can solve with just a few gigabytes of data," Hammons said. It's not that more data hurts the models, he added. It's just that the team can get the predictive accuracy it needs from a few features. And if predictive analysis for business can deliver acceptable results with less data, there's no reason to look at more.

Data science teams in all industries face an influx of data from a growing number of sources, which is forcing them to consider how much data and what types to include in their models. Consumer tracking firm The Nielsen Company takes a somewhat different tack than Dell -- for its purposes, more data often equals better results.

The company, which sells curated consumer data sets and software for developing data-driven marketing initiatives, integrates third-party data into its own data sets, creating a broad collection of data. At the conference, Jean-Pierre Abello, Nielsen's senior director of global engineering, said this helps cover up potential biases in data that can torpedo predictive modeling projects by skewing analyses.

For example, he said Nielsen partners with other data providers, including Facebook and Experian. Each data source tends to overrepresent certain segments of the population, but by combining various sources with its own data, Nielsen is able to mitigate these biases, according to Abello. For this kind of customer-focused predictive analysis for business application, trends toward increased data creation and the growing availability of diverse data sets can be a good thing.

"The world is becoming increasingly connected," Abello said. "We've seen the rise of [the internet of things]. It's happening on the consumer side. That means that now a whole category of consumer devices can collect data. Now we can more accurately predict the things [consumers] are looking for."

Dig Deeper on Predictive analytics

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

How do you find good data in your predictive analytics process?
Excellent to see this message getting out. I've already commented elsewhere the IoT is giving many an opportunity to relearn 'garbage in garbage out'.  Data semantics and quality are crucial as well as a good enough understanding of how the methods work, including how they can mislead.
Not always.  Many times I don't.  It's why I'm looking at technologies that don't depend on having all the data but still have the capabilities to predict. #augmentedintelligence 


Data has become a religion, but data doesn’t speak for itself. A collection of data may not adequately represent the phenomena one might think. Typically, data is collected and stored according to the rules/logic and model of the application it records. That may not be enough for further analysis, it may only represent the situation needed for that application. This goes beyond just data errors, its endemic. As a result, ML, especially unattended ML, is likely to be off the mark. We talked about data reduction at the edge, but what if those repeating “normal” messages that got squeezed were significant when compared to other simultaneous readings from other sensors?

Take the example of 1.5 images of skin lesions used for a dermatology agent to spot malignancies. It turned out to produce a high level of false positives. Reason? Common practice for radiologists to measure the size of the lesion with a common ruler when it appeared to be at or greater than 3cm. The ML algorithms interpreted the appearance of the ruler as malignant. Unexpected bias. And the well-known issue of Facebook producing only “white guy” news? We have a long way to go.

Given its apparent leaps in understanding, it sort of begs the question, why now and not before? These techniques have been around for decades, or longer:

  • Ordinary Least Squares Regression

  • Logistic Regression

  • Decision Trees

  • Support Vector Machines

  • Naive Bayes Classification

  • Clustering Algorithms

  • Principal Component Analysis / Independent Component Analysis

  • Ensemble Modeling (I read a paper on this 20 years ago).

I’m going to suggest three reasons:

  1. The obvious one is the amount of data, and the power to process it, has reached a level where it is possible that anyone with access to the data and computer can develop models with these techniques with a presumed level of credibility (which as I said above, is a dangerous assumption). A data scientist doesn’t have to take a tray of punch cards t the computer room and ait overnight for result. Instantaneous results is intoxicating.

  2. For a while, it was adequate to do some “predictive modeling” but the discipline was all in the knowledge and skill of the modeler. The idea of machines finding stuff for you became so alluring that ML/AI nas gone from a meme to a mania, with thanks to the vendors, bloggers and analysts, most of whom have never touched this stuff.

  3. An exploding number of people with less-than-thorough skill have been taken up by data-science-hungry organization, dubbed data scientists, providing suspicious results