cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
United States Presidential Election Prediction Model & Swing State Behavior Study (2021-EU-30MP-760)

Level: Intermediate

 

Saloni Patel, Student, Stanford OHS
Mason Chen, Student, Stanford OHS

 

This project investigates the validation of a prediction model and the actual result of the 2020 United States presidential election. The prediction model consists of the predicted election result, which is derived from the z-scores of the number of infected cases, deaths and unemployment increase rates for each of 15 “swing states” along with the 2012-2016 election result average. In order to identify the most important swing states, a Swing State Index was derived using the 2012, 2016, and 2020 election outcomes.  The predicted election result is then subtracted in response to the media’s report about how Donald Trump is expected to lose 3-5 percent of his votes from the 2016 election. The model is used to compare the level of accuracy between the predicted 2020 election result and its subtracted values against the 2020 actual election result. The paired t-test and regression test are used to test the significance between the 2020 actual result and the 2020 predicted result as well as the 2016 actual result and the 2012 actual result to see how the 2020 predicted result compares with the 2016 election result and the 2012 election result in predicting the 2020 actual result. A one proportion hypothesis test is also used to compare the accuracy of the 2020 predicted result with the 2020 actual result. 

The next part of this project studies factors that influenced the voting behavior of the 15 key swing states in the 2020 United States presidential election by linking statistical clustering methods with notable political events. In addition to key decisions made in the Trump administration, factors unique to this presidential election such as the global COVID-19 pandemic and the Black Lives Matter movement were investigated. Hierarchical clustering was used to group the 15 swing states based on the Swing State Index, and the relationships between each cluster were attributed with events that may have factored into the cluster behavior. The most representative and significant swing states were identified to be Arizona, Georgia, Wisconsin, and Pennsylvania (based on the clustering history) as well as Michigan and Minnesota (based on the Swing State Index). After analyzing specific events that affected these six states’ voting behavior, the Black Lives Matter movement and concerns over health care were the most significant factors in President Trump’s defeat. Next, the state of Georgia was further studied to better understand the influence of COVID-19 and the economy on the state’s voting behavior. By adjusting the ratio of the COVID-19 values (infected cases and deaths) and economic value (unemployment rate), it was found that the economy was of greater importance than COVID-19 to Georgian voters. The study of similar events by connecting political science (e.g. government decision-making) and clustering methods can be applied to future elections to better predict the outcome of important swing states and, thus, the overall election results. 

All calculations and analysis are done on the JMP 15 platform.

 

 

Auto-generated transcript...

 

Speaker

Transcript

Saloni Patel Okay, so hello, my name is Saloni and today I'll be presenting our project, the United States presidential election prediction model and swing states study behaviors study.
There are two parts in this project, the first involves creating and evaluating a model meant to predict the 2020 US presidential election.
The second part of the project will study swing state behavior in the 2020 US presidential election and identify key events that affected the voting patterns in the election using hierarchical clustering methods. All the analysis was done on the JMP 15 platform.
To clarify our project does not focus on all 50 US states and instead we will only study the top 15 swing states.
The swing states are states that can reasonably be won by either the Democratic or Republican presidential candidate, as opposed to safe states that consistently lean towards the one party.
Additionally, the US voting system depends on the Electoral College system that gives a set number of votes to each state based on population numbers.
There is a total of 538 electoral votes so a presidential candidate must get 270 electoral votes to win the presidential election.
Since most of the states are known to vote for either a Democratic or Republican candidate without hopes of being swayed out of the normal voting pattern,
the Electoral College system and the presidential election result depends on the bulk of the swing states that can potentially be won by any of the candidates.
A win by even a small margin results in that candidate acquiring all the votes the state has to offer, so swing states are especially impactful in determining the next president.
So, to begin we conducted this project in hopes of better understanding the historic 2020 US election that occured in the middle of a global pandemic and socially as well as economically unstable times.
The first part of our project's objectives is to identify key swing states, create a prediction model based on the influence of COVID 19 and the economy in those identified swing states, and lastly validate the prediction model with the actual election results once those came out.
So the first step in our prediction model is identifying top 15 swing states from the past three elections using this swing state index.
We use this formula to determine whether the states are swing states or not. It is also important to note that this swing index does not take into account which side each state votes for,
but rather on the election results itself. In other words a state could have voted for the same side all three years yet by very different margins and still be counted as a swing state.
We will further study this index in the next part of the project, but right now, all we use this index is to...for is to identify the top 15 swing states.
Once the swing states are identified, we derive the first value we will need to calculate the predicted 2020 election result. This is the 2016-2012 composite win margin.
To calculate this value we took the 2016 result and the 2012 election result. In the formula,
we gave the 2016 results twice the weight because it was more recent than the 2012 election and we gave another twice the wait for the 2016 election.
Because President Trump was present in the 2016 election running as president, while Joe Biden was present in the 2012 election as a candidate for vice president. In total, the 2016 results will have four times the weight as in 2012 in the 2016-2012 composite win margin.
Next we identify factors that are unique to the 2020 and factors that voters may vote according to.
We found that the global COVID-19 pandemic and the following hit the economy took were important factors
unique to 2020 so we collected the infected cases and death cases due to COVID-19, as well as the unemployment increase in each state.
Next we applied the Z standardized transformation to avoid any sampling mean and variance biases.
Using those Z scores, as Z-infected, Z-deaths and Z-unemployment, we derived the Z-COVID index. This index will represent the impact the global pandemic and the following economic hit each state experienced.
Lastly, we calculated the 2020 predicted result, using the 2016-2012 composite win margin and the Z-COVID index.
Once the 2020 election passed, we recorded the
2020 actual election results and proceeded to validate our prediction model and whether our choice of factors did a good job helping predict the 2020 election result.
Additionally, since the media before the election had predicted that Trump will lose 3 to 5% of the votes from 2016, we decided to subtract certain percentages from the predicted results.
In the table below, that predicted result is the zero percent category and the reductions can be seen as well.
To analyze the results we compared the predicted results with the 2020 actual election results, using the regression and paired t-test.
To compare how the 2020 predicted results compared with previous election results at predicting the 2020
actual result, we also include the 2012 election and 2016 election results in our evaluation. Lastly, we also conducted a 1-proportion hypothesis test to test the 2020 predicted results accuracy.
To begin we conducted a regression test with the election results presented just from each state.
The 2012 result compared to the 2020 actual election result did not yield a significant result.
However, the regression tests between 2016 actual and 2020 actual displays a significant result and the highest R squared of 0.81.
The results of the regression is also close to 1, at about 1.17, suggesting a strong regression relationship. The regression between the 2020 predicted and the 2020
actual also displays a significant result but a lower R square of 0.3.
From these results, the regression between the 2016 actual and 2020 actual results had the highest R Square and slope closest to one, despite having declared a different winner.
It is reasonable to find that the 2020 election results would be correlated with the 2016 election results since Trump lost those swing states narrowly wone in the 2016 election by small margin, so just Michigan, Pennsylvania, and Wisconsin.
Next, the paired t-tests that will also compare the election results percentages of each state. we use the paired t-test because the same states or pairs are being assessed against each other.
The paired t-test only found a significant difference in the means of the 2016
actual election results and the 2020 actual results, which makes sense since these results had a high regression test significance.
This would suggest that the means of the 2012 election and the 2020 predicted results are similar to the 2020, actual meaning that the election results are similar.
In the 2012 election, the Democrats had won the election and the predicted results had predicted that Democrats would win in 2020, while in 2016 the Republicans had won the election. This can explain why 2016 is significantly different from the 2020 actual election results, while the...
while the 2012 actual and 2020 predicted are not.
Lastly, we use the 1-proportion hypothesis test to test how the 2020 predicted results matched with the 2020 actual results.
Unlike the regression and paired t-test, the 1-proportion test compares the states and which side they voted for.
The regression and paired t-test only compared the election results without any indication on which side the states voted for.
Therefore, this test is more powerful and validating the prediction model, since it compares the predicted side each state would vote for
and which side the states actually voted for. We assign the states that voted for the predicted side with a pass and those that did not vote for the predicted side with a fail.
In total 12 out of 15 states received a pass, as they were predicted accurately, while the other three received a fail. We set the success value at pass and the scale is a sample proportion of 0.8.
Since we want the sample proportion to be greater than 0.9 or 90% accurate, we set the hypothesized proportion to 0.9.
Since the 0.8 proportion failed to exceed the 0.9 at the 95% confidence level, the prediction model failed to be 90% accurate, failed to reject the null hypothesis at the 95% confidence level. According to the proportion of our sample this model is 80% accurate.
To summarize the regression test showed significance between the 2016 actual results and the 2020 actual election result.,
as well as a weaker significance between 2020 predicted and 2020 actual election results. We theorize that this may have been because this election, President Trump lost those swing states narrowly won in the 2016 election.
The paired t-test showed significant difference between the 2016 actual and 2020 actual, and we theorized that this may have been because those two elections declared different winners.
President Trump won the 2016 election yet lost the 2020 election. Additionally the 2012 and 2020 predicted results are not significantly different from the actual 2020
result...election result, which may have been because they both declared the same political party as the winners.
As...lastly, the 1-percent hypothesis test failed to reject the null hypothesis, and so our prediction model is not 90% accurate at the 95% confidence level.
Arizona, Wisconsin, and Minnesota,
which could suggest that there were other major factors besides the impacts of COVID-19 and unemployment rates that influenced the 2020 election result.
This is where we transition to the next part of our project in which we will group states based on their swing state index and identify them with key events that took place in 2020 that could have influenced the swing states' voting behaviors.
So the questions this part of the project will address from the last is which events and factors influence the swing states to vote the way that they did.
How much more or less did voters care about COVID-19 than the economy and other side investigations? Can we use statistical tools to link political events with voting patterns?
The goals for this project is to study the previously identified swing states voting patterns by linking statistical clustering methods to political events.
We will also adjust the Z-COVID index, or as we will now call it Z-Ratio, with new ratios to better understand the importance of COVID 19 and the economy in voting behavior.
Previously the Z-COVID index had two by one ratio, where the values of COVID-19 infected cases and deaths
were given twice the weight compared to the unemployment increase value, since there were two values for COVID-19 and only one meant for the economy.
We realized that each State was impacted differently by the pandemic, so we thought it would be appropriate to analyze the effects of switching this two by one ratio to other ratios.
First, we will go back to the swing state index, which helps identify the swing of each State using the election result percentages from the past three elections.
A negative election result indicates that the swing voted for Democrats, while positive indicates a Republican vote.
The larger the magnitude and the more negative the swing index, the more that states voting patterns have swung.
If the state changes direction then the signs of the two differences will not be equal, causing the swing state index to be negative and display more of a swing behavior.
From this table, we can see that Michigan and Minnesota have negative values of the largest magnitude, which means they have been swinging the most in the past three elections. Overall, the swing state index is quite useful in understanding basic voting patterns for the swing states.
However, the swing state index cannot identify key events that caused the voting patterns.
We used hierarchical clustering to study states with similar voting patterns and list potential factors that affected their voting behavior.
Hierarchical clustering grouped the 15 swing states into four different clusters, as seen on the right.
We used this method, because of its bottom up approach, where every state is its own cluster before they emerged one at a time and moved up the hierarchy. On the right, Iowa and Ohio can be seen in red indicating that they're in the same cluster.
As mentioned previously, the hierarchical clustering divided the swing states into four clusters, the first cluster consists of Iowa and Ohio.
Both of these states had voted blue or Democratic in the 2012 election, yet red or Republican in the 2016 and 2020 election.
The second cluster has Georgia, Arizona, North Carolina, and Florida. All these states except North Carolina became bluer or redder, or in other words are starting to favor one side heavily.
But third cluster consists of Wisconsin, Pennsylvania, Michigan, Nevada, New Hampshire and Minnesota.
All of these states, besides Nevada, have a negative swing index, meaning they're the most inconsistent swing states.
The last cluster has Colorado, Virginia and New Mexico, which are all relatively blue states or states that have consistently voted Democrat and in the 2020 elections, voted blue by a larger margin thatn previously.
Now that we have all the clusters and idea of their characteristics, we looked at the clustering join history, which identifies the top pairs of states or which two states are the most similar in their clusters.
From the join history, the first two pairs are Wisconsin with Pennsylvania from the third cluster and Georgia with Arizona from the second cluster.
Both pairs are part of the clusters that had states that switched from red to blue in the 2020 election.
After further research, we found that Wisconsin and Pennsylvania appeared to have concerns for the economy,
dissatisfaction with President Trump's healthcare related policies, such as his efforts to weaken the Affordable Care Act formed under the Obama administration,
as well as concerns for the environment, all of which ultimately made majority of the voters vote Democratic.
However, in Georgia and Arizona, we see that major shifts in demographics, such as more registered Latino voters in Arizona
and the Black Lives Matter movement that exposed serious racial injustice, ultimately caused majority of the voters to cast a Democratic ballot.
Through hierarchical clustering we were able to separate states into different groups based on their voting behaviors and create connections on which key events caused the observed voting behavior.
Although we found key events that influenced each state's voting behavior, hierarchical clustering did not tell us the weight each event played in the individual swing state's voting behavior.
COVID-19 and the economic recession that followed.
COVID-19 and the economy.
However, we had to assume that each state would have the same Z-ratio, which gave COVID-19 twice the weight, resulting in a two by one ratio for each state.
However, from the hierarchical clustering, we found that each state has a unique situation and their voters cared for different issues. To adjust the Z-ratio and...the Z-ratio, we created a value called the Ratio Variable. The Ratio Variable will determine the ratio of the importance of COVID-19
or Z-COVID index, which represents the infected cases and deaths in each case versus the economy or the unemployment which represents the annual unemployment increase rate in each state.
Once the Z ratio is adjusted, with a few different ratio variables, such as 0.1, which creates a ratio of one by 10 giving the economy 10 times the importance,
it is implemented into the full formula used to calculate 2020 predicted results.
These adjusted 2020 predicted results are compared against the 2020 actual election results to determine which ratio best explains the state's situation and how much of importance COVID-19 and the economy had in influencing the voting behavior.
We decided to study Georgia's voting behavior closely since it appeared to stand out compared to the other swing states.
For one, Georgia was the first state to reopen business in April, while the rest of the states did not.
Additionally, Georgia was a key state in the 2020 election, which President Trump had an eye on even after the election results were announced in attempts to overturn them.
Georgia voted blue by a small margin and had an election results of -0.3%.
The adjusted 2020 predicted results for Georgia was potted and a marker for the Georgia...for Georgia's actual election result was placed on the graph on the right.
From the graph we see that the adjusted 2020 predicted results with the ratio variable of 0.75 had the value of -0.2%, which is the closest to Georgias's actual election result, -0.3%.
The 0.75 ratio variable means that the ratio is three by four, indicating that the economy was more important issue to a majority of the voters in Georgia.
This makes sense because, as mentioned previously, Georgia was the first state to reopen business in April, indicating a strong concern for the well being of its businesses and economy.
In this project we explored different key events and their importance in influencing the voting behaviors of 15 identified swing states using statistical methods.
First hierarchical clustering was utilized to group swing states based on their voting behavior in the past three elections.
From this we found that in the second cluster consisting of Arizona, Georgia and others,
were mostly affected by issues regarding civil rights while states such as Pennsylvania and Wisconsin in the third cluster had voted for Joe Biden due to concerns for the economy, healthcare, and environment.
Overall, the worsening COVID-19 situation, racial movements such as Black Lives Matter movement,
COVID-19
and the economy, had on each state. Georgia was explored in more detail, and it was found that a three by four ratio matched best with the actual election result,
suggesting that the economy was a more important issue to voters compared to COVID=19, and this makes sense because Georgia was the first state to reopen business in April.
Thank you for listening to my presentation.