Can We Trust Opinion Polls?
The election season is here in my home state of Tamil Nadu (India). Apart from the intense campaigning, mudslinging and catchy party propaganda songs, the election season is also the time for statistics. Many leading survey agencies have hit the road, with an aim to understand the ‘pulse of the people’, and to determine which political party/leader has the people’s trust.
In the recent past, the number of opinion polls has risen astronomically, and so has the criticism against them. The frequent failure of polls to accurately predict results has been the primary reason for the growing criticism of opinion polls. This has almost become a global phenomenon.
In my limited experience, anecdotally speaking, many think of opinion polls as some sort of a scientific experiment, where the correct result can be achieved every single time. To them, opinion polls are the political equivalent of a pH paper, which turns to red or blue based on people’s opinions. Far from this perception, despite their scientific design, opinion polls do fail many a time. This is not always due to malpractice or incompetence, but could completely be caused by the sheer complexity of factors affecting people’s political views and choices.
In this article, I attempt to generate data for a hypothetical country called ‘People Land’ and examine how accurate opinion polls are in predicting election results. This example would hopefully illustrate the complexity underlying the accuracy of opinion polls.
People Land is a parliamentary democracy, where the political party with the largest number of members in parliament forms the government. There are 100 members of parliament representing the 100 constituencies in the country. Each constituency has a voter population of 100,000. There are five political parties in the country.
Each constituency is unique in its own way, in terms of the average age of the voters, their economic status (read income), educational background (read years of formal schooling), and social affiliations. I first randomly select the average age, income, and years of schooling for each constituency. I ensure that there is some positive correlation between the data for income and schooling to reflect that the two factors often work hand in hand.¹
Based on the average age, years of schooling and income of each constituency, I then generate data of the three variables for the 100,000 constituents. For example, if the average age of a constituency is 20, and there are four constituents, the age distribution of the constituents could be 22, 22, 18 and 18.² The technical details of how I generated this data are provided in the footnotes.
In addition to the three variables, every constituency has a proportion of people who have strong social affiliations, that sometimes influence their decision. This has no correlation with their income, education or age (yes, it is not true in reality, but remember this is People Land).³ All others may have mild social affiliations that could play a limited role in their decision making.
The Voting Pattern of People Land
Let us begin by exploring how citizens decide who to vote for in People Land. In the simplest of scenarios (call this Scenario 1), let us assume that people vote solely based on their income and age.
The voters of People Land have a ‘voting calculator’ (similar to a Body-Mass Index calculator) that gives them a number between zero and one when they key in their income and age. This is called their ‘voter score’. They then compare this score with a scale to see which party they must vote for. If the value is between 0 and 0.2, they choose Party 1, if it is greater than 0.2 and lesser than or equal to 0.4 they choose party 2, and so on.⁴ The table below lists how the choice of parties works.
For example, a person with an age of 20, and with an income of 100,000 might get a value of 0.5 from the calculator. They would then choose Party 3, as it is between 0.4 and 0.6.
In another scenario (Scenario 2), we can complicate things by including more factors into the equation such as their social affiliations and years of schooling.⁵ The calculator in Scenario 2 would include these additional factors into consideration before giving the voter scores.
How do Opinion Polls Perform in the Two Scenarios?
Naturally, we would assume that opinion polls would work well in the first scenario, where there are fewer factors affecting peoples’ choices, than in the second.
Let us put this theory to test. In each scenario, let us assume that there are 100 survey agencies that conduct opinion polls. Each agency collects a completely random sample of 30 individuals from every constituency to predict the winner of that constituency. The survey agencies, aggregate their respective results from each constituency to make a prediction of the overall winner of the election (very similar to how it works in reality). For simplicity, let us assume that everyone reveals their true preference in surveys and that everyone votes on election day.
We then look at how many surveys of the 100, made the correct prediction of the overall winner, in each of the two scenarios.
Surprise, surprise! The opinion polls perform better in Scenario 2. 80 out of 100 opinion polls in Scenario 2 got the result right. In Scenario 1, despite sharing an identical survey strategy with Scenario 2, only 35 opinion polls out of 100 got the result right. What’s going wrong?
Well, to answer this, let’s look at the ‘voter score’ distribution in both scenarios.
In the case of Scenario 1, the electorate is almost equally split among the five parties (in this case the close contest in popular votes is also seen across a number of constituencies). This is because of the very simplicity of voter preference that merely tracks the smooth age and income distribution.
In Scenario 2, we find the preference of the electorate to be more concentrated among Parties 2, 3 and 4 (the same concentration of votes among three political parties is seen across a number of constituencies). The various factors affecting voter choice interact with each other and concentrate people’s opinions.
We can relate this to reality when people vote for a political party for one particular reason while compromising on their other requirements. For example, the business people might vote for a party primarily for their economic agenda and compromise on their other requirements. Similarly, students might vote for the party’s education agenda and compromise on their other requirements. This sort of compromise in favour of more important policies eventually concentrates the votes between a few parties.
Despite the race between the three preferred parties being very close, the opinion polls manage to still perform very well. This is the sheer impact that a concentration of opinions and thereby a reduction in the number of closely contending parties can have on opinion polls.
It is important to note that the accuracy of opinion polls is not quite reliant on the number of factors affecting people’s choices. It is more affected by the distribution of voter preferences. A tough neck and neck contest between the parties is specifically a problem for opinion polls.
The rationale behind this result is fairly straightforward. In the case of Scenario 1, when there is a close contest, it is absolutely pivotal for opinion polls to have a large sample size such that they capture the vast variety of opinions. Small sample sizes could often lead to the lack of complete representation of the various opinions, which severely affects accuracy. Note that this is not intentional, but merely a consequence of smaller sample sizes. Opinion polls, in these cases, are no better predictors than the octopuses, squirrels and bears that predict the winners of elections and soccer games.
In the case of Scenario 2, the concentration of opinions helps, as it reduces the number of opinions that ought to be ideally considered in a sample. This enables even a small sample size to accurately capture the complete variety of opinions and therefore the winner of the elections.
When the contest is tough, small sample sizes simply fail to capture the diversity in opinion and hence fail to predict the correct winner. In the case when there is a concentration of votes between a few parties, opinion polls perform better, as the sample size can actually capture the complete variety in opinions.
Even in a simple case with made-up data, where people choose their preferred party based on their score from a ‘vote calculator’, we find that a sample size that worked perfectly well in one scenario fails in another. Hence, the success of polls is not merely a factor of the sample size but also the underlying homogeneity/heterogeneity in people’s opinions.
Even a scientifically administered survey, as in the case of the surveys in People Land, fail due to the sheer complexity in the diversity of people’s opinions. This is one of the reasons why opinion polls fail in countries like India, where there is a rich diversity in people’s voting preferences.
Footnotes (Technical Details)
 I generate the average age for constituencies from a normal distribution with a mean zero and variance of two. I generate the average years of education and income data for constituencies from a bivariate normal distribution with a mean of (0,0) and the following covariance matrix,
For the purposes of this article, the actual magnitudes of the ‘education’ or ‘age’ are irrelevant. You can treat the realization of these random variables as ‘constituency prior scores’, which would then be inputted into some monotonic deterministic map to arrive at the actual average age/years of education/income of the constituency.
 I generate this from a normal distribution (bi-variate in the case of income and education), with the mean set at the constituency's prior scores (see footnote 1). I set the standard deviation at one for age. I use the following covariance matrix for education and income,
 The proportion of the population in each constituency with strong social affiliations is selected from a combination of a Beta (2,5) and Uniform Distribution (0,1). The proportion with ‘Extreme Left Social Affiliations’ (LE) and that with ‘Extreme Right Social Affiliations’ (RE) are as follows,
I then randomly select the target amount of individuals as a simple random sample (without replacement) to be assigned the respective social affiliation scores (+4 for right and -4 for left). All others who are not sampled in either of the social affiliation groups are given a value from a normal distribution with a mean zero and standard deviation of 0.1.
 The voter’s score between zero and one is based on a simple sigmoid function of the following form,
I chose the range for the priors and variable distributions to ensure that we would get values between 0 and 1 for V(Income, Age). If 0.5*Income + 0.5*Age were to be greater than 4 or lesser than -4 in all cases (i.e. for the entire population of People Land), then we would have a case where everyone votes for Party 5 or everyone votes for Party 1. This would go against the very purpose of this simulation. Statistically speaking, the use of scores instead of actual values is really not a problem, as we can always define a monotonic invertible affine map between [-4,4] and [Variable Plausible Min, Variable Plausible Max] such that the actual age or income can be translated to our scores.
The voter’s score between zero and one is based on a simple sigmoid function of the following form,
Author: Akshay Natteri Mangadu