Manage Learn to apply best practices and optimize your operations.

Don't let these big data myths derail your analytics project

A number of myths about big data have proliferated in recent years. Don't let these common misperceptions kill your analytics project.

As the idea of big data analytics becomes more popular, more vendors are touting its benefits and executives are looking to capitalize on its potential. But along the way, a number of big data myths have developed that, if unchecked, could limit the insights businesses can derive from their data.

One of the most pervasive myths in this era of big data is that more data is always better. But speaking at the Big Data Innovation Summit held in Boston in September, Anthony Scriffignano, senior vice president of data and insights at information provider Dun & Bradstreet Inc., said that's not always the case.

There's always a lot of noise in data that can make hearing the signal difficult. When you indiscriminately collect more data, the noise ratio goes up. Scriffignano said it's important to understand the context of the data you're collecting has a specific use in mind before you decide to store it away.

"Our collective response so far has been to build bigger hard drives," he said, although that approach has limits. "But our response to the problem has not kept up. We have to know what's really happening so we know what data to throw away and what's important."

This is a myth that won't go away quietly. As storage costs continue to plummet, the cost of saving data has never been smaller. This has contributed to the popularization of the data lake concept, where businesses stash away troves of data without really knowing what it is or how they'll use. But while the direct expense of adding storage space may be minimal, the total cost associated with superfluous data making analytics work more difficult may add up.

More tools not always the answer

Another common myth is that simply throwing more tools at the problem will solve it. This is a message that has been perpetuated as the number of big data tools proliferates and vendors compete for slices of the market. But the tools are only as good as the people that use them.

"Very often, as data scientists, we're expected to bring wizardry to whatever we're doing, but I've discovered that a lot of times it works better with a little bit of elbow grease," said Abe Gong, senior scientist at Jawbone, a maker of fitness trackers and other electronic devices.

Perhaps the most pernicious big data myth is that truth is static and wholly objective.

For example, Gong was getting ready to do an analysis of a large data set that he knew would have a lot of incomplete data or duplicate entries. He first thought about writing an algorithm to clean up the data set but knew it would still leave some bad data. After thinking about the problem, he decided to just ask everyone in the IT department to spend a couple minutes manually cleaning the data set. Pretty soon it was ready to go, no one person had to invest much time cleaning it, and it was cleansed better than it would have been if an algorithm had done it.

Perhaps the most pernicious big data myth is that truth is static and wholly objective. In fact, it can be constantly moving.

This was borne out by John Hogue, a data scientist at General Mills Inc., who talked about how his team has worked to develop an internal dashboard that correlates various advertising campaigns with things like coupon use, positive social media posts and sales data.

Users of such systems, Hogue said, must remember that the data they have represents just a brief snapshot in time. In a rapidly changing world, inferring objectivity from this kind of data would be a mistake.

"You have to take a snapshot of what the truth is today," Hogue said. "That's going to be a challenge for a lot of business users."

Beware of mushrooming data quality issues

Misplaced trust in data objectivity can play out in other ways. It's particularly important to remember that analytical models won't always deliver accurate results when one model is used to feed another. And basing decisions on low-quality data may not be any more effective than going with an executive's gut instinct or even simply choosing at random.

Scott Hallworth, chief model risk officer at financial services company Capital One, said he has seen cases where one team builds a model for its own purposes, but then another team sees the new data that's available and incorporates it into their own models, not knowing that the first model only produced scores at a particular confidence level. With each new model, the quality of the data output that's produced can go down.

To deal with this problem, Hallworth recommended building governance mechanisms into models from the beginning and making sure that anyone who ends up using the output of a model knows how reliable the data is.

"A lot of people forget that when you build a model or report, you are generating data," Hallworth said. "Someone is going to use it and transform it into something else. That's what causes a lot of problems."

Ed Burns is site editor of SearchBusinessAnalytics. Email him at eburns@techtarget.com and follow him on Twitter: @EdBurnsTT.

Next Steps

Big data myths are common when it comes to data management

These big data myths need to be busted -- now

Hadoop is often the subject of misunderstanding in big data

This was last published in October 2014

PRO+

Content

Find more PRO+ content and other member only offers, here.

Essential Guide

Guide to big data analytics tools, trends and best practices

Start the conversation

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

-ADS BY GOOGLE

SearchDataManagement

SearchAWS

SearchContentManagement

SearchCRM

SearchOracle

SearchSAP

SearchSQLServer

SearchSalesforce

Close