Guide to big data analytics tools, trends and best practices
A comprehensive collection of articles, videos and more, hand-picked by our editors
As the idea of big data analytics becomes more popular, more vendors are touting its benefits and executives are looking to capitalize on its potential. But along the way, a number of big data myths have developed that, if unchecked, could limit the insights businesses can derive from their data.
One of the most pervasive myths in this era of big data is that more data is always better. But speaking at the Big Data Innovation Summit held in Boston in September, Anthony Scriffignano, senior vice president of data and insights at information provider Dun & Bradstreet Inc., said that's not always the case.
There's always a lot of noise in data that can make hearing the signal difficult. When you indiscriminately collect more data, the noise ratio goes up. Scriffignano said it's important to understand the context of the data you're collecting has a specific use in mind before you decide to store it away.
"Our collective response
This is a myth that won't go away quietly. As storage costs continue to plummet, the cost of saving data has never been smaller. This has contributed to the popularization of the data lake concept, where businesses stash away troves of data without really knowing what it is or how they'll use. But while the direct expense of adding storage space may be minimal, the total cost associated with superfluous data making analytics work more difficult may add up.
More tools not always the answer
Another common myth is that simply throwing more tools at the problem will solve it. This is a message that has been perpetuated as the number of big data tools proliferates and vendors compete for slices of the market. But the tools are only as good as the people that use them.
"Very often, as data scientists, we're expected to bring wizardry to whatever we're doing, but I've discovered that a lot of times it works better with a little bit of elbow grease," said Abe Gong, senior scientist at Jawbone, a maker of fitness trackers and other electronic devices.
For example, Gong was getting ready to do an analysis of a large data set that he knew would have a lot of incomplete data or duplicate entries. He first thought about writing an algorithm to clean up the data set but knew it would still leave some bad data. After thinking about the problem, he decided to just ask everyone in the IT department to spend a couple minutes manually cleaning the data set. Pretty soon it was ready to go, no one person had to invest much time cleaning it, and it was cleansed better than it would have been if an algorithm had done it.
Perhaps the most pernicious big data myth is that truth is static and wholly objective. In fact, it can be constantly moving.
This was borne out by John Hogue, a data scientist at General Mills Inc., who talked about how his team has worked to develop an internal dashboard that correlates various advertising campaigns with things like coupon use, positive social media posts and sales data.
Users of such systems, Hogue said, must remember that the data they have represents just a brief snapshot in time. In a rapidly changing world, inferring objectivity from this kind of data would be a mistake.
"You have to take a snapshot of what the truth is today," Hogue said. "That's going to be a challenge for a lot of business users."
Beware of mushrooming data quality issues
Misplaced trust in data objectivity can play out in other ways. It's particularly important to remember that analytical models won't always deliver accurate results when one model is used to feed another. And basing decisions on low-quality data may not be any more effective than going with an executive's gut instinct or even simply choosing at random.
Scott Hallworth, chief model risk officer at financial services company Capital One, said he has seen cases where one team builds a model for its own purposes, but then another team sees the new data that's available and incorporates it into their own models, not knowing that the first model only produced scores at a particular confidence level. With each new model, the quality of the data output that's produced can go down.
To deal with this problem, Hallworth recommended building governance mechanisms into models from the beginning and making sure that anyone who ends up using the output of a model knows how reliable the data is.
"A lot of people forget that when you build a model or report, you are generating data," Hallworth said. "Someone is going to use it and transform it into something else. That's what causes a lot of problems."
Big data myths are common when it comes to data management
These big data myths need to be busted -- now
Hadoop is often the subject of misunderstanding in big data