The biggest winner of this year's presidential election wasn't the Democrats or the Republicans; instead, it was statisticians like Drew Linzer, assistant professor of political science at Emory University, who correctly forecasted the outcome months ago. Linzer runs a website called Votamatic
SearchBusinessAnalytics.com recently sat down with Linzer, whose paper on the same topic is soon to be published by the Journal of the American Statistical Association, to talk about his use of political data, how he takes on potential data quality issues, and why, despite such successful results, campaigns are still crucial.
Where does your political data come from?
Drew Linzer: There's a long history of research in political science about factors you can use to predict presidential outcomes ahead of time. … So there's a whole family of these sorts of models that different academics have come up with. Some of them have a better track record than others. The one I depended on was created by a colleague of mine, Alan Abramowitz. The model he's developed uses three factors: one is the growth rate of the GDP in the early part of the election year. That comes from the Bureau of Economic Analysis. That's all public. That's a government statistic that gets updated throughout the year. The second is the approval rating of the incumbent president. That's measured by Gallup in June -- again, public. The third factor is whether or not the president's party has been in power for one or more terms. It's sort of a measure of voter fatigue with the incumbent party. Obviously, that's public. Just using those three publicly available measures, you can get a pretty decent sense of the outcome.
Winners of the 2012 presidential election
- Drew Linzer, assistant professor of political science at Emory University
- Nate Silver, New York Times' blog FiveThirtyEight
- Sam Wang, the Princeton Election Consortium
- Simon Jackman, professor of political science at Stanford University
- Josh Putnam, visiting assistant professor of political science at Davidson College
You and Nate Silver have both talked about using polling data.
Linzer: That's the other part of it. There are a number of websites now -- polling aggregators -- that collect the results of publicly released public opinion polls conducted by news organizations, primarily, as well as private public opinion research firms that conduct these polls and release them for marketing purposes. They do this to publicize their own research abilities. Websites like The Huffington Post, RealClearPolitics, Talking Points Memo, they aggregate these things and they publish them on their websites. With The Huffington Post, in particular, you can go to the website and download the results of these polls. They've even developed APIs [application programming interfaces], and they completely opened up their historical archives this year, which is an incredible resource.
When does polling data start to play a role in forecasting?
Linzer: The polling data itself starts to become available over a year in advance of an election. The problem is, it's not really that informative at that stage. You wouldn't want to use those polls to predict the election outcome. … Usually in the last two months, the polls become more and more accurate in terms of telling you who's going to win as you get closer to Election Day. And usually in the last couple of weeks, and certainly in the last week, the polls are going to be very informative about who is going to win.
But your forecast was made in June. So you didn't rely on the most accurate polls to make your prediction?
Linzer: That's sort of the trick of my approach: The idea that there are these two sources of information -- historical information and polls. Early on, my model rates the information of the historical factors more heavily. As the campaign goes on, the historical factors don't change, but we get more accurate information from the polls as they get more and more precise. So the forecast transitions from being based on historical factors to being based more on the polls. That's how Nate Silver's model works, too. He just weights them in a different way than I do.
A challenge for businesses is data quality.
Linzer: It terrifies me. That's a huge issue.
How did that impact on your analysis?
Linzer: One of the major considerations people like me have to deal with is the possibility that the polls, for some methodological reason, are systematically flawed. We expect there to be random variation in the polls due to sampling; you can't do anything about that. … But there are other sources of error that are systematic. Sometimes we call these things 'house effect,' which is that certain pollsters tend to be more or less favorable to certain candidates just because of how they conduct the polls. What we do is we assume these house effects cancel out. That, on average, there will be some that are a little bit more pro-Democratic and some that are a little bit more pro-Republican. That's turned out to be a pretty good assumption historically, but that's not necessarily the case. If the polls are consistently more Democratic or Republican, just due to, say, the difficulty of reaching people on cellphones who are going to be more Democratic or whatever it is, then garbage in, garbage out. That's a real concern. Some models add a bit more uncertainty to account for that possibility, but that's not something that's knowable until Election Day.
More on political data
Read how political campaigns used predictive analytics software in 2008
Find out how, in the U.K., political parties failed to be transparent online
Connect the dots on how this year's election proved a data quality point
Is there something about political data that makes it ripe for analysis? In other words, could you build as accurate of a model for health care or finance data?
Linzer: If anything, political data is harder. When we talk about political phenomenon and modeling social behavior in politics -- that's what people in my field of political science do -- it's unbelievably hard because of how unpredictable human behavior is, but also because of how little data there [is] to go on. If you're talking about health care or any of these other fields where you just have reams and reams of data -- you know, big data -- you can make a few assumptions about the data in order to get the patterns out. You still need to be using models, and models are just assumptions. The way I distinguish good data analysis from bad data analysis is the cleverness of the modeling approach.
What does that mean, 'the cleverness of the modeling approach?'
Linzer: You have all of this raw information and you believe there are patterns in it. So there's a systematic element in the data, but there's also a random element in the data. Statisticians call this noise, or we say the world is stochastic. The trick is getting the pattern out -- separating the pattern from the noise. In other words, extracting as much information as possible from the raw data, and not getting fooled by the part that's not systematic. Statistics is a very creative field because how you do that is really wide open.
If you can predict an election in June, why do candidates need to continue campaigning?
Linzer: Because campaigns do lots of other important things. First of all, it's not always going to be the case that the election is essentially predicable in June. This year was a little bit unusual in that respect. But even if it is predictable, you never know what's going to happen. … And the only reason the models work is because they assume the campaigns are going to happen. If one candidate runs a terrible campaign or doesn't raise enough money or whatever it is, they'll suffer because of that. These models work because, historically, both candidates compete on a roughly level playing field. That assumption is built into the forecast. If one doesn't show up, that's an unexpected thing. There are all sorts of normative reasons, like why it's good for democracy for these campaigns to happen, but also from a statistical standpoint, we assume they're going to campaign.