The fundamental objective of polling is to infer, from a small subset of a given population, the opinion of that population. Polling is an application of statistics, but in most cases the outcome is not empirically verifiable. The true value of public opinion can only be known by querying each member of the public; an expensive task sometimes referred to as *Democracy*.

For election based polls, pollsters seek to measure a potential outcome, overtime, with hypothetical questions like “If the election were held today.” As time passes, “today”, eventually becomes *Election Day*. Iowa and New Hampshire have had their *Election Days*, but before they voted, a total of 192 polls, from 28 unique pollsters were conducted. The vast majority of these 192 polls contain results which can never be verified empirically; a smaller minority however, can be directly compared to the election result they sought to measure.

Using our aggregated polling data and the known outcome, we should be able to assess which pollsters did a *better* job. An election’s outcome serves as a single basis point by which *a* pollster’s result can be assessed empirically. It is the only time in which public opinion is definitively known. This assessment, perhaps called *empirical accuracy*, is our focus.

One definition of *better* is the accuracy with which a pollster measured or sampled a demographic sub-sample. This concept derives from the concept of expected demographic outcome. The procedure outlined below could be applied to any sample or combination of sub-samples. We’re going to focus on the gender sample and we’ll call it the *gender expectation*.

Using any given poll, these are our steps to determine the *gender expectation*. The first two steps derive information from the actual outcome, generally from exit or entrance polling; the same data must also be released by the pollster or the analysis cannot be done:

- Determine the outcome within each sample group; in our case this would be the outcome from only males and the outcome from only females.
- Determine the proportion of participation from each sample group; in our case that would be the number of males which voted and the number of females which voted.

The following steps then require the inclusion of the poll’s data:

- Using the information from #1 and #2, re-weight the data provided by the poll to deduce a new topline result. This procedure does not punish a pollster for weighting incorrectly.
- Calculate the deviation of each sample. Determine the difference for each group from #1 with that reported by the poll. Next calculate the deviation of the re-weighted outcome in #3 with the actual outcome. Summing all deviations yields the overall
*score*; the lower the better.

A carefully crafted example is below to illustrate the procedure; our example assumes a fixed and consistent participation rate for #2; 4 males and 6 females are always included in each example. This simplifies the examples, but nullifies the significance of #3 as the re-weighting doesn’t alter the outcome.

Let’s suppose a municipality exists with 120 residents; let’s then assume that 100 residents are eligible to vote. Breaking News! An election was just held with two candidates, *A* and *B*; 40 people voted. The result along with an exit poll (for each gender) is provided below:

Actual | Candidate A | Candidate B | Total |
---|---|---|---|

All | 16 | 24 | 40 |

Female | 0 | 24 | 24 |

Male | 16 | 0 | 16 |

The above data provides us with #1; all males voted for *A* and all females for *B*. We are also given #2, which is the number of participants in each group; 16 males and 24 females.

There were four fictitious pollsters that released polls the previous day; the raw topline margin, the deviation between the two candidates, of the four mystery pollsters is below:

Pollster | Topline Margin |
---|---|

1 | 0% |

2 | 0% |

3 | 10% |

4 | 10% |

To illustrate the lack of depth in the above ranking, we’re going to assess their accuracy by using the *gender expectation* method. We’ll assume each pollster sampled 10 people and included the correct ratio of males/females to reduce the number of possible variables.

The first was conducted by a *Bad* pollster:

Bad | Candidate A | Candidate B | Total | Deviation |
---|---|---|---|---|

All | 4 | 6 | 10 | 0% |

Female | 3 | 3 | 6 | 50% |

Male | 1 | 3 | 4 | 75% |

This *Bad* pollster matched the overall outcome, but on closer inspection of the gender crosstabs, they didn’t do so *good*. They were off by 50% in the female demographic, compared to the actual outcome, and off by 75% among males. They produced an accurate overall result, but for the wrong reasons. They included 6 individuals (60%) in their sample, 3 of each gender, which did not align with their sample group’s actual outcome. The total deviation from the actual outcome was 125%; the sum of each demographic and the overall deviation. To reiterate, this sum includes the calculation in #3, but because the sample group sizes remain static, the calculation is still 0%, the same as the initial topline. This is a terrible poll which got lucky with their overall result.

Another *Good* pollster, published these results:

Good | Candidate A | Candidate B | Total | Deviation |
---|---|---|---|---|

All | 5 | 5 | 10 | 10% |

Female | 3 | 3 | 6 | 50% |

Male | 2 | 2 | 4 | 50% |

This *Good* pollster missed the overall outcome, but was actually more accurate than the *Bad* pollster, by measure of deviation. Each gender deviated by 50% from their actual outcome, which caused the overall outcome to deviate by 10%. The overall deviation is 110%; this is a simple example of a pollster getting the topline result wrong, because the sub-samples were wrong.

Another, *Better* pollster also published results:

Better | Candidate A | Candidate B | Total | Deviation |
---|---|---|---|---|

All | 4 | 6 | 10 | 0% |

Female | 1 | 5 | 6 | 17% |

Male | 3 | 1 | 4 | 25% |

This pollster was mostly correct with the gender samples and accurate with the overall result. Their deviation is a mere 42%.

The last, and *Best*, pollster’s result:

Best | Candidate A | Candidate B | Total | Deviation |
---|---|---|---|---|

All | 5 | 5 | 10 | 10% |

Female | 1 | 5 | 6 | 17% |

Male | 4 | 0 | 4 | 0% |

The *Best *poll erred in the opinion of just one female, but missed the overall result. Their deviation however was just 27%. This is the most *accurate* pollster.

Let’s now rank the 4 pollsters by their deviation, derived from their *gender expectation*:

Pollster | Deviation |
---|---|

Best | 27% |

Better | 42% |

Good | 110% |

Bad | 125% |

Not surprisingly, the *Best* pollster had the lowest deviation. If we revisit the naïve ranking from above, I’ve revealed each pollster’s true identity:

Pollster | Topline Margin |
---|---|

Bad | 0% |

Better | 0% |

Good | 10% |

Best | 10% |

The naïve topline margin resulted in the *Best *poll, getting ranked the worst. By using *gender expectation* we’re able to more deeply analyze a poll’s sample and truly assess whether it was accurate for the correct reasons.

We'll use this technique going forward to assess the accuracy of each pollster in a given matchup; rankings from the 2016 New Hampshire Democratic Primary will be published later this week.