olly - Fotolia


Three predictive modeling flaws that cripple data science projects

Data science can be incredibly valuable if done right, but just as damaging if done wrong. Here, a data science expert discusses three common predictive modeling pitfalls.

Data science projects can provide immense business value by guiding companies toward increased revenue or improved operations, but can also be damaging if done wrong.

Say someone on a marketing team who fancies himself a citizen data scientist uses Domo, Google Analytics or a user-friendly business intelligence tool to get insight into sales. The data shows the team will hit its numbers, and it presents that information to top executives.

However, that data may not be accurate, as you can't see the methodology behind black box tools, said Ian Swanson, founder and CEO of DataScience, a data science services and platform provider in Culver City, Calif.

Building accurate predictive models requires customization and knowledge of various methodologies and approaches, which are applied depending on the scenario and data set. In other words, if you plug data into a black box predictive analytics tool, and trust that it has applied the right methodology and approach to analyze your data, you're taking a gamble. The best data science teams would never do that -- and neither should you.

"It might be cool to pull some insights out of [black box tools], but it can be truly dangerous," Swanson said. "Should a citizen data scientist be making decisions that impact the company? Not one big company wants [that]."

Swanson discussed three common pitfalls to beware of during data science projects, including predictive modeling and data quality issues, the importance of data lineage and how to ensure you have the proper analytics workflow.

Poor data quality

The success of a data science project begins with good data. If the data going into a predictive model is bad, the predictive outputs won't be accurate.

It might be cool to pull some insights out of [black box tools], but it can be truly dangerous.
Ian Swansonfounder and CEO, DataScience

With that, a critical first step in predictive modeling is exploring and evaluating data quality, determining how much data scrubbing is required and wrangling data into a usable format, Swanson said.

Data science teams also need to check that the right type of data is even there. Take the 2016 U.S. presidential election: Predictive models pointed to Hillary Clinton as the winner, and clearly those predictions were wrong. One reason, according to Swanson and other data science experts, was that there were critical voices missing in the data. In addition, the data fed into predictive models couldn't be validated; some voters may have said they planned to vote one way, but ultimately voted differently.

Lack of data lineage

A data science team needs to be able to follow the lifecycle of the data it uses, including where the data originated and how it was collected. The team also needs to be able to explain what it found during the data exploration phase, the analytics methodology and process, and how the company's business teams will be able to use the information.

Without clear data lineage, executives may not trust the data, and instead may choose to lean more on their intuition than data analysis, Swanson said.

"We connect the dots, so when executive stakeholders see the data, we show them how the data and the results were found and how [they] can be used in products," he said. "All the dots need to be pulled together."

Ivory tower analytics teams

Oftentimes, data science teams are centralized, and they don't have integrated workflows with business teams. Swanson recommends that data scientists be embedded in business teams to ensure they understand the problems that need to be solved and to work together to figure out how predictive analytics output can be productized (if that's the goal). Team integration also helps data science and business teams to identify analytical opportunities and to build upon institutional knowledge.

"We see data integrity challenges and [problems in] choosing the right algorithm, but the most important is the workflow -- if we solve this problem in this way, can it be used?" Swanson said. "Not having the business stakeholder at the table is a common pitfall. If you are trying to solve a problem for marketing, are the marketing people at the table with the data scientists?"

Of particular importance is for data science teams to work with the engineering team that harmonizes the predictive models and puts them into production, he added. If those two teams don't use the same language, data science projects could be dead on arrival. 

Next Steps

The bumpy road to data science success

Predictive modeling lessons you don't want to learn the hard way

Startup gets a jump using data science tools

Dig Deeper on Predictive analytics