Monday, January 7, 2008

Election Math

Today's CNN headline screams out "Obama opens double-digit lead over Clinton." When you read the article, you find that the poll, of 341 Democrats, showed Obama over Clinton, 39% to 29%. This is as compared to a Saturday poll showing them in a dead heat, at 33-33 (http://www.presidentpolls2008.com/ is a nice polling site, because it shows the polls side by side and you can link to the details of the statistical error and actual questions asked).

Did so many people change their minds in one day? If we ignore other polls, then the statistical evidence is not conclusive. Why? Because the margin of error in the 39-29 poll is 5%, meaning that the Obama's percentage is likely somewhere between 34 and 44%. The Saturday poll, also with a 5% margin of error, indicates his percentage is between 28 and 38%. The overlap between these two ranges means the numbers might not have changed at all. Rather, the difference is mere statistical error, which is an artificat of the sampling. By luck of the draw, the Sunday poll may have found more Obama supporters, even though no one changed their mind.

When comparing two polls taken independently (as above), the error is more than the stated error (5% above) but less than the sum of the stated error in the two polls (5%+5%=10% above). We can compute the error of the difference in these two polls as 7%, which means that Obama's numbers, which appeared to go up by 6% (from 33% to 39%), may have gone down by as much as 1% or up by as much as 13%.

Some math: The 7% is computed as the 1.96 multiplied by the square root of the sum of the squared standard deviations of the polls. The standard deviation of the poll is the error rate (5%) divided by 1.96, or about 2.5%. In a standard probability distribution called a Normal Distribution, 95% of the data falls between plus or minus 1.96 standard deviations from the mean. Thus, in the latest poll, the mean for Obama was 39%, with a standard deviation of 2.5%.

Several implicit assumptions are made in computing the error rate in these polls, primarily summarized as: 1) the Normal Distribution is appropriate, 2) the sample is a random sample of all who will vote in the Tuesday Democratic primary, and 3) the answers in this poll are reflective of the way the voters will actually vote come Tuesday.

Assumption Review
1) Normal Distribution. This one is easy. For a large, random sample and a multiple choice question (who will you vote for), this assumption is always close to reality except when the number polled is very small or the percentages (as are those for say, Kucinich) are close to 0% or 100%. For Clinton and Obama, there is no real issue here, since the sample size is moderately large and their percentages are in the 30-40% range (a neat demo that shows how close a distribution is to Normal, depending on the sample size and percentage, is at http://www.ruf.rice.edu/~lane/stat_sim/binom_demo.html).

2) Random Sample. This is more difficult. Suppose Clinton voters go to chuch on Sunday, followed by lunch, while the Obama voters are home-bodies. This problem can be called selection bias. If there is church-lunch/home-body selection bias, then, in the Sunday poll, a random dialing of phone numbers would have surfaced more Obama voters and would not have been a random sample of Tuesday's voters, as opposed to Saturday, where you might have gotten more equality. [There is generally a second underlying issues of refusals--people who refuse to be polled. If these voters are more likely to vote for Clinton, then Clinton's numbers will be under-stated, but this would be true for both polls that we are comparing.]

3) The difference between how people said they felt in today's poll versus how they will vote in Tuesday's election. I find this difference, which can be called measurement error, the most troublesome. Take, for example, the fact that, in the CNN Saturday poll, one-fourth of voters stated that they had not yet decided, another quarter were only leaning toward someone, and just half had definitely decided. This is, of course, only what people are saying, and often people do not want to admit indecision, so the true numbers of undecided may even be higher. Still, if the undecided's vote even 60-40 in favor of hillary, it would erase the 10% lead of Obama.

The 5% error rate (and 7% error of difference rate) does not take the above issues into account. It implicitly assumes they will have no effect. Thus, the true error rate in election polls is likely far higher.

If we consider other polls, the Obama lead, and the change seems to be clearer. In the seven polls published th 6th of January, Obama has an average lead of about 2-3%. In the 5 polls published Friday and Saturday, Clinton led in all of them, by around 5 points. We can prove this change, from pre-Sunday to Sunday, is statistically significant. However, because of the selection bias and measurement error issues above, it may not be indicative of the outcome on Tuesday.

My personal guess? Obama by a good margin...but a lot can happen in a day.

1 comment:

Jim said...

Thanks for explaining the complex statistics. I think that Obama's bounce is very real. People really seem to believe that he can be elected President due to his success in the Iowa caucuses.