Sergey Nivens - Fotolia

News Stay informed about the latest enterprise technology news and product updates.

How predictive modeling and forecasting failed to pick election winner

Nearly all predictive modeling algorithms were way off in picking the winner of the presidential election. What went wrong can strike any predictive analytics project if data scientists and other analysts aren't careful.

Prior to the 2016 presidential election, nearly everyone -- from data science guru Nate Silver's FiveThirtyEight website to The New York Times -- was predicting a huge likelihood of a comfortable victory for Hillary Clinton. And then their models broke.

What went wrong for the forecasters was hardly a unique set of problems, and it can strike any predictive modeling and forecasting project if analytics teams go down the wrong path. It involved a mix of overconfidence, poor data quality and mistaking a statistical likelihood for an ordained certainty.

"Unfortunately, [forecasters] give these numbers to one decimal place, and it sounds like it's a scientific formula, but it's not," said Pradeep Mutalik, an associate research scientist at the Yale Center for Medical Informatics, who blogs about elections for Quanta Magazine. "It's the overselling of certainty, and they ended up with egg on their faces."

Predicting the unpredictable

The day prior to the election, The New York Times Upshot election forecast gave Clinton an 85% chance of victory. The Huffington Post's model gave Clinton a 98% chance of winning. The FiveThirtyEight forecast was among the most modest, giving Clinton a 71.4% edge.

These forecasts weren't wrong, per se. The FiveThirtyEight model essentially said Donald Trump won three out every 10 of its simulations. Even The Huffington Post's model, bullish as it was about a Clinton win, didn't completely discount the possibility of a Trump victory.

And to be fair, Nate Silver tweeted just after 6 p.m. EST on Nov. 8, "This doesn't seem like an election in which one candidate had a 99% chance of winning," and frequently talked about the uncertainty surrounding polls and forecasts in the weeks ahead of the vote.

But that wasn't how a lot of forecasts were promoted by the prognosticators or interpreted by the public. By providing such fine-grain detail in their forecasts, modelers gave the public an impression of certainty.

People don't understand probabilities

"The problem with that is that it's a probability, and people don't understand probabilities," Mutalik said. "I think it was a problem of data presentation. It's very irresponsible to present data like this to a lay public. I think that probability should not have been used to score the race."

Mutalik added that forecasts like the Cook Political Report, which gave a qualitative scale based on which way certain states were leaning, rather than trying to quantify likely votes, did a better job of describing the uncertainty of the race.

One of the reasons forecasts missed the mark was an overreliance on poll data. Today's forecasters develop their models by aggregating as many polls as they can get their hands on. Every poll has a margin of error, but forecasters assume bringing together polls from different sources cancels out this error. The presumption is each poll will have different reasons for error, like oversampling one demographic group. As long as each poll doesn't have the same reason for error, the overall strength of the aggregated polls compensates for the weaknesses of individual polls.

But in this election, there may have been more error in the polls than was recognized at the time. There's been a lot of talk about shy Trump voters who found it socially unacceptable to admit even to pollsters who they supported, and this common cause of polling error could have pulled aggregations wide of the mark.

Forecasters discount significant events

There's also the issue of enthusiasm. Michael Cohen, an adjunct professor in George Washington University's Graduate School of Political Management and CEO of Cohen Research Group, a public opinion and market research firm in Washington, D.C., said forecasters discounted the large crowds at Trump's rallies and the strong engagement the candidate garnered on Twitter.

These factors are harder than poll data to work into predictive modeling and forecasting, but, ultimately, they pointed to voters who were more willing to show up at the polls on Election Day than nominal Clinton supporters.

"When you're trying to understand what's going on in the country, or in your company, you don't just look at one piece of data," Cohen said. "The bottom line for me is that polling can't be the only data you look at."

Ultimately, the industry that's built up around predictive modeling and forecasting for elections may be due for a reckoning. James Taylor, CEO of consultancy Decision Management Solutions, said an election between two specific candidates is a one-time event that will generate its own circumstances. As a rule, one-time events can't be predicted well using historical data. "Basic statistics mean that one-off events can't be analyzed for accuracy," he said.

The notion of assigning a single number probability to a particular outcome can be more challenging than we've come to believe and may not be that helpful to the way average voters think. "It's human nature," Mutalik said. "Even when polls give the margin of error, people just take the expected outcome."

Next Steps

Data visualization plays key role in developing predictive modeling algorithms

How PayPal uses predictive modeling to stop fraudsters

How predictive analytics can answer complicated business questions

Dig Deeper on Predictive analytics

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

How do you think predictive modelers could have better predicted the 2016 election?
The predictive models that we nearly dead on were the ones that used social media with sentiment analysis. Text analytics is far superior in this day and age than traditional survey techniques. SAS text analytics with sentiment analysis gets the most accurate results. Don't let anyone tell you that Watson can do it.
According to Nate Silver, the national popular vote percentage difference between projected and actual weren't that far off (I vaguely recall within the margin of error).  The accuracy problem that comes about is that one would really need 50 state based polls rather than a national vote.  The reliance on random phone polling is breaking down (caller ID, cell phone polling being legally mandated to be done by hand).  But, the biggest issue in my opinion is that the projections didn't take into account the transition from likely-voter to not-likely-voter and then correlate that to preferred candidate.  Having an election with candidate approval rates for both candidates under 50%, this is an important stat.  By aggregating so many candidate polls, without a corresponding voter enthusiasm tracking number, this was a problem.  I believe that this was evidenced by HRC coming up with less electoral votes due to losses in the rust belt, due to turnout in the rust belt.  Trump got slightly less votes than Romney, but HRC was down around 10% less than Obama ('08) and still significantly down from Obama in '12. 
Or it could be that the PM estimates were spot on, and the election was stolen because the election servers were hacked. We don't know because no one has actually counted the paper ballots in any of the contested states. All of the polls have never been this far off and in many cases the exit polls did not match the results in many of the battleground states.

I'm not sure how such a magnitude of discrepancy between the pollsters information and the actual election was even possible.   It's obvious that the pollsters and the media  were so biased against Trump before the election,  that they actually either intentionally distorted what they were hearing from the polled people or  selected to take the samples biased to areas were the left was more represented.   This was not any deficiency of the mathematical basis of the predictive models, this was intentional  underlying leftist bias on the media as well as in the rabble that was contracted to place the polling calls...  So obvious,  only lefties in academia would be scratching their heads wondering where the mathematical Statistical Analysis failed...  Dummies,  do you know what is gigo?   Your polls were garbage in, garbage out.    

Or, the election was stolen. The only way we can rule that out is to count the actual paper ballots. This is not a Dem/vs Rep issue, but goes to the very foundation of our democracy. Take off your obviously partisan blinders and consider. The future of our republic demands that this question be resolved.

Although the exit polling matched the voting tabs rather closely, there are several examples of where the exit polling cannot be correct. For example: In some cases Trump got 29% of the Hispanic vote. Two percent more than Romney. That simply did not happen. So COUNT THE VOTES.

Oh, I got it now! Thanks a bunch for finally making me understand! The election WAS STOLEN! The election was stolen! The ELECTION WAS STOLEN...!!!!!!!!! Let's all keep repeating louder and louder, the election WAS STOLEN, THE ELECTION WAS STOLEN, THE SERVERS WERE HACKED! Now truly it's all finally explained, thanks to you Sir. Thank you so much for clarifying the mystery of the "impossible" result of this election. I can sleep in peace now. The right Wing stole the election, but the pollsters were right and totally impartial, their statistical methodology was mathematically sound and its practical implementation doesn't require any review or research mathematically or otherwise. We just need to put the political RIGHT in jail for stealing the election, and bar them from participating in any other election. In fact we may not need anymore elections, we may just allow only Left Wingers in government, all as you know, sparking honest and squeaky clean. The obviously dishonest Right Wing will continue to pay the Tax burden that the Left determines them to pay. This obviously will meet with your desires, I'm sure... Was a pleasure talking to you, Good Bye!
The model is as good as the data, and clearly in this case the data source was biased and features selected based on the biased data and hence the model outcome
Did we forget Chaos Theory? Given the US elections and shall we not forget Britexit, how many more times before it's a given that currently Predictive Analytics is a misnomer. "i.e. a wrong or inaccurate name or designation." To be fair, regards elections, more individuals need to do their "own" homework, not purely react to general hearsay and what the media FORCE feed us as reality, which is clearly the EASY option. After all, if garbage in equals garbage out and predictive implies some degree of QUALITY, how sure are we regards the value, integrity of the data used?
No mention of the questions selected by pollsters and the way in which they were phrased or presented. Not very hard to see how this could widely skew results. An old saying, "Figures don't lie, but liars figure!"