In my last blog post I started the analysis of fictitious survey data concerning what percent of people plan to vote for candidate A in the upcoming presidential election. By fitting the Beta Binomial distribution, I was able to discover that the percent (*p*) of people varies between cities, and can be modeled with the Beta(27.64, 28.63) distribution. This distribution has an average of 0.491. But this doesn’t necessarily mean candidate A will get 49.1% of the total vote when combined across those cities I sampled. Cities with a higher population have more weight when the popular vote is totaled.

If all the cities have close to equal populations, then the popular vote total will be around 49.1%. But some cities are larger than others in reality. If the smaller values of *p *come from the larger cities, then the result will be less than 49.1%. If the larger values of *p* come from the larger cities, then the result will be greater than 49.1%. If the range of *p* is scattered across small and large cities, then the result will likely be close to 49.1%.

In my fictitious example, the populations of the cities I sampled are included in the file Voting.com. I retrieved the data from Wikipedia, but the original source is the US Census Bureau. By the way, it’s easy to retrieve data from a Web page with JMP by using the **Internet Open** command on the File menu. Simply provide the URL, and JMP imports any tabular data it finds at that URL.

How do we estimate the overall proportion of votes for candidate A for the 100 cities I sampled? Use the survey results as estimates of the percent in a city, multiply by the population of the city, and then sum across cities to get overall totals. But not so fast -- we need to adjust the population numbers down to account for the fact that not all the population is of voting age. Also, not everyone of voting age actually votes, so a further adjustment is needed.

Again turning to the Census Bureau, I found data on the proportion of a population that is of voting age. The numbers vary from location to location. Instead of using one number (like the average) to summarize all the proportions, I fit a distribution. Since proportions are between 0 and 1, the Beta distribution is a great candidate. Using JMP’s distribution fitting capabilities, the fitted distribution turns out to be the Beta(397,126). Below is a graph of the distribution. The bulk is between 0.7 and 0.8, which means that for most locations the percent of the population that is of voting age is 70% - 80%.

I also found data on what percentage of a voting age population actually vote. Again fitting a Beta distribution, I get the Beta(30,32). See the graph below. The bulk is between 0.3 and 0.7, which means that for most locations the percent of a voting age population that actually vote is 30% - 70%.

I’m a fan of Monte Carlo simulation. So let's use the JMP scripting language to run a simulation. The script is attached with the data if you want to try it yourself. The script is shown below:

Running the simulation 10,000 times (which takes less than 10 seconds) gives the following distribution for the proportion of the popular vote going to candidate A in the 100 cities.

The average of the distribution is 50.37%. For an interval estimate, use the 2.5th and 97.5th percentiles. I estimate candidate A will get between 50.12% and 50.61% of the popular vote for those 100 cities I sampled.

50% is in the lower tail of the distribution. What is the probability Candidate A will get at least 50% of the vote in the 100 cities? We can fit a distribution and estimate the probability. If you are thinking the distribution looks Normal, then you are right. Recall the central limit theorem, which says the distribution of an average is asymptotically normal, as the number of items in the average increases. The overall proportion is a weighted average of individual city proportions; thus it is normal. Using a Normal distribution with mean = 0.5037 and stdev = 0.00124, the probability that candidate A receives at least 50% of the vote in the 100 cities is 99.85%. That is good news if you like candidate A.

Remember that the distribution of *p *has a mean of 49.1%. This value is more than 10 standard deviations below the mean in the simulation results. In fact, it’s not even shown on the histogram. This means there is virtually no chance the overall % for Candidate A will be as low as 49.1%. This would have been the average assuming the populations are the same. Thus, accounting for the city populations certainly made a difference when totaling across cities. Maybe not much though, since 50.3% - 49.1% = 1.2%.

This example is simple and doesn’t account for a number of factors. For example, how do you handle within-city variation? A different sample of people could yield different results, and people could very well change their minds between now and election day. A good idea might be to model the within-city variation with a Beta distribution as well. And what about cities with less than 100,000 in population? Is it safe to extrapolate these results to smaller cities? And what about the electoral college? For an interesting answer to the electoral college question, see this article about what a BYU professor is doing to predict the election outcome.

You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.