BOSTON -- When doing predictive analysis for business decision-making, the output of models is only ever as good as the data that goes into it.
This may sound obvious, but in this era of big data, where massive data sets are available to analytics users throughout the enterprise and self-service analytics tools are proliferating, there can be a propensity to build predictive models around any data at hand. The need to understand what's in a data set and where it came from, as well as to ensure high-data quality standards, has never been greater.
"There's just an extraordinary amount of information being created, and it's just too much for a human to process," said David Ledbetter, a data scientist at Children's Hospital Los Angeles. "How do you determine what's important and direct doctors to specific patients and recommend treatments?"
Ledbetter spoke in a panel discussion at this week's AI World 2017 Conference & Expo here. He said patients in the hospital's pediatric intensive care unit can have up to 500 sensors generating data about them every second. Finding a kernel of information in the sea of data that will have predictive value can be challenging.
Get to know your data creators
One of the ways his team deals with this difficulty is by working closely with the clinical staffers who work with the monitors that create the data. Of course, the data science team could develop a deep learning algorithm that crunches through all the data and zeros in on the most predictive data features and then build a predictive score around that. But that takes time and compute power. In this case, it's easier to lean on the clinical knowledge of the medical staff to identify important signals in the data.
"Every time we have a conversation with the clinical team we learn new and interesting things," Ledbetter said. "There are numbers in our data sets that you can't even begin to understand the meaning of until you talk with the clinical team."
Of course, none of this is specific to healthcare. Getting access to the right data at the right time is critical for any project aimed at using predictive analysis for business decision-making.
At the conference, Marc Hammons, a principal software architect at Dell, said his team maintains models aimed at predicting failures in the hardware that the computer company sells, enabling predictive maintenance. His team has access to data from many sources, primarily logs that describe the state and usage of hardware.
Data volume needs change by project
Many data science projects benefit from large volumes of data, but Hammons said the one at Dell doesn't require true big data. Instead, it works fine by applying traditional statistical methods to smaller data sets.
"We're finding that a lot of the problems we need to solve we can solve with just a few gigabytes of data," Hammons said. It's not that more data hurts the models, he added. It's just that the team can get the predictive accuracy it needs from a few features. And if predictive analysis for business can deliver acceptable results with less data, there's no reason to look at more.
Data science teams in all industries face an influx of data from a growing number of sources, which is forcing them to consider how much data and what types to include in their models. Consumer tracking firm The Nielsen Company takes a somewhat different tack than Dell -- for its purposes, more data often equals better results.
The company, which sells curated consumer data sets and software for developing data-driven marketing initiatives, integrates third-party data into its own data sets, creating a broad collection of data. At the conference, Jean-Pierre Abello, Nielsen's senior director of global engineering, said this helps cover up potential biases in data that can torpedo predictive modeling projects by skewing analyses.
For example, he said Nielsen partners with other data providers, including Facebook and Experian. Each data source tends to overrepresent certain segments of the population, but by combining various sources with its own data, Nielsen is able to mitigate these biases, according to Abello. For this kind of customer-focused predictive analysis for business application, trends toward increased data creation and the growing availability of diverse data sets can be a good thing.
"The world is becoming increasingly connected," Abello said. "We've seen the rise of [the internet of things]. It's happening on the consumer side. That means that now a whole category of consumer devices can collect data. Now we can more accurately predict the things [consumers] are looking for."