Many analytics professionals have high hopes for big data, but speakers at the Predictive Analytics World conference struck a decidedly cautious tone when discussing the concept as it relates to building predictive models.
"To me, big data is just a hot-flash term, but it's nothing new to us," said Gary Miner, senior statistician and data-mining consultant at StatSoft.
If you're going to make sense of data, you need to sort through the noise, and you're going to end up with a smaller data set.
senior statistician and data-mining consultant, StatSoft
There is still disagreement around what the term big data actually means. The most common definitions talk about high data volume, velocity and variety. But the precise volume needed to qualify a data set as "big" is imprecise. Miner said some people think several terabytes of data qualifies as big, while others say it takes hundreds of terabytes.
Either way, he feels the importance of big data has been overblown. He said it is possible to find some really telling correlations in rather small data sets. For example, he talked about how some medical breakthroughs have come out of trials involving fewer than 100 patients. This is because smaller, more refined data sets often make it easier to single out the trend in the noise.
The fact that storage space is getting cheaper has led many in the analytics world to ponder the possibilities that may come from analyzing whole data sets, but Miner said you typically get better results more quickly by using randomized samples from data sets.
"If you're going to make sense of data you need to sort through the noise, and you're going to end up with a smaller data set," Miner said.
Michael Berry, analytics director at TripAdvisor for Business, said the current interest in big data comes from a desire on the part of businesses to implement a single piece of technology that solves multiple problems. He said vendors have been glad to play into this desire, promising that their big data software will greatly simplify business analytics projects. But he said this drive for an easy, simple solution is mostly a fantasy.
"While it's never been true, it makes a good sales pitch," he said.
Instead of hoping that big data software will solve every analytics problem, Berry recommended working to improve predictive models. The variables that define a predictive model ultimately matter more than the amount of data fed into the model.
And adding more data may simply increase the time it takes to reach new insights, Berry said. When analyzing data sets, patterns often reveal themselves quickly. If a pattern becomes apparent after analyzing 100 data points, there is no need to continue analyzing 100,000 more data points. The pattern will still be there. All you will have done is lengthen the project. Adding more data may simply lead to diminishing returns.
But not everyone was quite so bearish about big data. Peter Amstutz, analytics strategist at advertising agency Carmichael Lynch, said it is important, when developing predictive models, to collect data containing as many variables as possible. Sometimes it may be possible to accumulate information on a broad set of variables from a single source of standardized records, but often an organization will need to collect large amounts of less structured data. This is where the idea of big data can be helpful.
Learn more about developing predictive models
See what kind of skills you need on your IT team
Read this definition of predictive modeling
Learn why predictive modeling projects fail
Amstutz recently helped Subaru implement an uplift modeling project that allows the car manufacturer to target its ad buys more effectively. Amstutz said he is always looking for new data sources that might contain information on consumer attributes that are relevant to building the profile of a consumer who may be receptive to Subaru's advertising. By looking at a greater number of variables, the advertising agency can precisely pinpoint the type of consumer who is likely to buy a Subaru.
It's not so much the amount of data that's important as it is the quality of the data. Eric Feinberg, senior director of mobile, media and entertainment at analytics vendor ForeSee, said large volumes of data are generally only helpful if they are standardized and accurate.
He added that the benefits of big data analytics vary greatly by industry. In studying sales trends, outliers that become apparent by studying full data sets may just add noise to the model, making it hard to find the true trend. But Feinberg pointed out that the outliers are exactly what analysts are looking for in fraud detection. So sales forecasting may work fine when using small samples, while fraud prevention efforts can benefit from big data analytics.
On the other hand, more traditional methods may work even better. Feinberg used the example of a medical device company that wants to build a better profile of its cardiologist customers. It could gather a large data set to find characteristics of likely buyers. Or it could simply pay cardiologists to participate in a focus group.
"That, in many cases, does the same thing," Feinberg said. "It's harder, it takes more time, but the outcome is a mature data set."