Okay, welcome everybody to this presentation in which I will talk about Pearl's Influenza Studies. Pearl a biostatistician, he lived quite a while ago during, specifically, the Spanish Flu, and he was the first one to analyze the weekly data that was collected by the US Bureau of the Census about the Spanish Flu, which occurred in 1918, 1920, and this for the large American cities.
Pearl, he was the first to do so to analyze the data regarding the Spanish Flu, and he wrote two reports about it, his influenza studies one and two, which we will revisit now during this presentation, and we'll see how we will be able to look into the data, analyze the data using JMP. This is joint work with Chris Gotwalt, who contributed to the methodological component, and Guido Erreygers, who initiated the idea of revisiting Pearl's influenza studies.
The overview of this talk is as follows. First, I'll discuss a little bit the Spanish Flu, what it was all about. Quite a deadly pandemic at that time. Then I'll introduce Pearl's Influenza Studies, and I'll talk a little bit about the data that Pearl used in his analysis. We have added some census data from 1910 on top of that for our analysis, which consists of a variable selection procedure with a null factor, a random factor, an independent normal factor in our analysis, and bootstrap simulation from the data. Then we'll be discussing our results and compare those results with Pearl's results, and then we'll conclude.
First of all, the Spanish Flu pandemic, to frame that a little bit, it was one of the deadliest pandemics in history, as witnessed by this list here showing the world's deadliest pandemics in overtime. You can see that the Spanish Flu here ranks fifth in terms of the number of deaths that it caused. T his list is headed by the Black Deaths, which killed about four times more people than the Spanish Flu did. Then below this list, you see COVID- 19 pandemic appearing, which we got all exposed to. It also appears in this list still.
Just as with COVID- 19, gatherings were more encouraged to happen outside, and also at the times already of the Spanish Flu pandemic there, that was the case as well, and even for classes to happen outdoors, as you can see on this photo.
The Spanish Flu pandemic consisted of three waves. The first one started in March 1918 in the US and spread to Europe and to the rest of the world. Then a more severe wave, the severest one started in August 1918 in France and spread rapidly to the rest of the world, as well as coincided with the end of the First World War. Then a third one which was less severe than the second one, but more severe than the initial wave, started at the early beginning of 1919 and hit some specific countries like Australia and Great Britain.
Here you see the timeline of the three waves of the Spanish Flu pandemic occurring in the US, more specifically because we have Pearl's data from the US which we look into. The death toll was humongous and most deaths occurred actually in India and China.
What was specific about the Spanish Flu pandemic was that many young healthy adults died as shown by a W pattern of mortality rates, which is here shown by the full black line. In contrast to the U shape of the normal seasonal influenza and pneumonia mortality rates which were registered prior to the pandemic.
For seasonal influenza and pneumonia, this U shape shows that individuals younger than four years of age and people older than 75 years, that these were most hit by seasonal influenza and pneumonia. Characteristic of the Spanish Flu pandemic was that besides those two age groups, also the young adults in the range between 20 and 40 years of age got specifically hit by this pandemic, eventually leading up to a huge death toll.
Then further onwards throughout history, epidemiologists and historians of epidemiology have applied statistical analysis to the Spanish Flu pandemic. More specifically, people worked on the US as well, namely, Markel and colleagues, they studied the effect of non- pharmaceutical interventions in American cities like school closure, cancelations of public gatherings.
Mamelund studied the influence of geographical isolation of specific areas and specific regions in the US. Clay, Lewis, and Severnini studied cross- city variation of 438 US cities, but also elsewhere around the world, so not only in the US but also in Europe, like in Norway and Spain. Data about the Spanish Flu pandemic were analyzed together with census data and so forth. The Spanish Flu pandemic kept be intriguing for many researchers over time.
About Pearl's influenza studies now, here you see the first of his influenza study, the beginning of it. First, he talks about the severity of the outbreak and actually that the death toll for the United States was set at about 550,000, which is about five times more than the number of people who actually died during the First World War. Hence, showing the severity of this pandemic and actually how deadly it was, how explosive it was. That was specifically what Pearl was interested in examining and relating to other variables.
The second of his influenza studies looks as follows, so the beginning at least. That's an update of the first of his studies because he got some criticism after first study by some peers who criticized him for not being accurate with certain variables. He had issues with what we now refer to as construct validity. Some of the data were not really measuring what they were supposed to measure, so they were not so accurate and they could be defined more appropriate, more accurate in order that they really measure what they should measure. He did that. He tackled the data again, the variables. Set up new definitions of the variables and moved forward.
The data for the first of his studies looks as follows. We have data from Pearl of 39 American cities. The response variable he wanted to study was the epidemicity index, which is given by the symbol I 5 in his studies. That was the first response variable he wanted to study and he wanted to also find other variables like demographics that could be predictive of this response variable.
Now, this response variable was a measure for the explosiveness of the outbreak in terms of epidemic mortality, so how explosive the outbreak was and hitting the various cities in the US. He defined the peak time ratio, so the peak of the excess mortality rates divided by the peak date. In this way, he wanted to compare cities with one single very sharp peak to cities with a long flattened curve of excess mortality rates. He devised this epidemicity index himself.
He wanted to relate this epidemicity index to various factors like the demographics. First of all, the population density for 1916, and then the geographical position, which is a straight line distance from Boston. He also included the age distribution, chi- square, which is an age constitution index showing the deviation in the age distribution of the cities from a fixed standard age distribution. He also studied the percentage population growth in the decade 1900, 1910.
Besides the demographics, he also involved the death rates for 1916, so prior to the pandemic. First of all, a n aggregate death rate from all causes and then specific diseases, death rates for specific diseases, namely pulmonary tuberculosis, organic heart disease, acute nephritis and Bright's disease, or failure of the kidneys, influenza, pneumonia, typhoid fever, cancer, and measles. There was quite some correlation between these death rates.
I already told you a little bit about the response variable that Pearl developed, namely the epidemicity index. That did not arise into just a single attempt, but he started actually from I 1. First of an epidemicity index that he further improved into I 2, I 3, I 4, and then he was happy enough to move along and to work with the I 5 epidemic ity index. To distinguish then cities with single, very sharp peaks to those with long, low flat curves of epidemic mortality.
He updated the variables from his first study and the second study, and in that sense, he also updated, he also modified the epidemicity index and the new epidemicity index. He referred to that one as I 6. As another variable of interest, he also defined to this destructiveness variable as containing excess mortality rates. Then he wanted to relate these two responses in his second study with the normal death rates, which were the mean death rates over the three years, 1915 to 1917. Then he also brought in the demographics. Again, he modified the age constitution index based on the 1910 census data, and he used that.
Then also he involved the sex ratio of the population of males versus females. The population density of 1910 is used in the first of his studies, remained in the second study. Then instead of a geographical position, he used latitude and longitude in his second study. Then lastly, he used the percentage population growth in the decade 1900, 1910, again.
These were Pearl's data to which we added some additional census data from 1910 which are given over here. Instead of the age constitution index, we used the share of the different age groups, and the pure numbers actually, because we were not really happy with the way Pearl defined the age constitution index. It was really quite complex and not clear enough so that we thought to just go about and use the 1910 share ages instead of this age constitution index developed by Pearl.
Besides that, we looked into the number of persons to a dwelling, the percentage of homes owned, the school attendance of population, 6 to 20 years of age, and the illiteracy in the population, 10 years of age and over, and see whether one of these factors or some of these factors could be predictive of the tree response variables in Pearl's studies.
What was Pearl's analysis all about? Well, he got into multiple correlation. He studied all the data making use of partial correlation coefficients as well as the normal correlation coefficients of zero order, but he did that very rigorously. He had computed this all by hand and did this quite well actually and took into account various other factors in these partial correlations by having those other variables being constant. For the partial correlations and the other correlations, he also computed probable errors to find out whether these correlation coefficients were significant or not, so he did this quite well.
Now, the analysis that we are using is the one in which we are going to actually select variables with a null factor and we are going to bootstrap our data. We are doing so because our P values for our data, which are unfortunately not orthogonal, they can become heavily biased. Since we are not really using nicely orthogonal data, the P values can become quite biased towards zero. It's always the danger with P values for observational data that unimportant factors all of a sudden become important and that the type one error rate is not under control.
To solve the fact that the P values are not uniformly distributed anymore in the case of an unimportant variable, to solve that issue, we are going to involve a random variable, a null factor, an independent normal variable into the analysis and see to what extent it appears in our procedure of selecting variables. This idea is inspired by or originated from the JASA paper by Wu, Boos, and Stefanski in 2007.
Specifically, what we have done is, well, we included a single null factor in the variable selection procedure and performed 2,500 bootstrap replicates for variable selection using JMP. Then we calculated the proportion of times each variable enters the model. Variables that enter as often or less than the null factor are ignorable, and variables that enter more often than the null factor, well, these are the ones that we are going to select as being predictive of the response variable.
In JMP, we specified two new columns, and actually, one formula is needed. That's the formula for the bootstrap frequency column that you see here. The bootstrap frequency column is based on our null factor, which is an independent normal variable that's reinitialized each time during the bootstrap simulation. Based on this reinitialization, the frequency column gets updated so that we have a 100 % resampling of our sample size of our data with the fixed sample size as it is.
Then for the variable selection, what kind of regression also would we apply that we could find out in the generalized linear model platform of JMP, because taking into account the frequency column also, if we do variance selection, we also need to do the distribution determination at the same time. It has to happen simultaneously. You can't do it apart based on the original data. Actually, the graph on the left- hand side is a little bit misleading because that graph contains the original data, but distribution determination you should actually do while doing the variable selection itself.
Assuming a Poisson distribution, well, that assumption was not rejected, so we could actually move forward with the assumption of a Poisson distribution, which is also maintained in the literature throughout for mortality rates and so forth for analysis. It was not really rejected and it was good to assume such a Poisson regression for the analysis.
Then also we applied Poisson regression in combination with variable selection for the epidemicity index I6. However, we switched to normal regression or less regression for the third response variable, the destructiveness variable.
The way to apply this regression in JMP by means of variable selection is by using the generalized regression platform where we then define the response variable for analysis and put the frequency column into the freq box. Then as model terms in the construct model effects window, we involve all the variables, all Pea rl's variables, together also with our null factor or independent normal variable that we also included.
Then we move forward with the regression procedure by selecting forward selection estimation method. As a criterion, we used the Akaike information criteria with the correction for small sample sizes to decide upon the final model to include the final model then to select with the selection of the variables.
The solution part that you see here is based on normalized data to put them on the same scales, so scales and center data and also to diminish the effect of multicollinearity in the data because there's quite some. Then you see again, the original predictors popping up in the lower output. Then we select a model with the lowest Akaike information criteria. One after the other, the variables coming being selected by forwards variable selection based on the Akaike information criteria. As you can see in this output actually here, the null factor got into the model. The null factor completely unimportant, uninformative, so got into the model, which is not good, of course.
We had to do this estimation method, this variable selection procedure 2,500 times, and we did this in JMP by right clicking on the estimate column and then hitting simulate and then selecting the frequency column as a column to switch in and out between the different bootstrap replicates so that initializations of the null factor were always being guaranteed between the different bootstrap replicates. Then finally, so we got the 2,500 model selections out of the bootstrap simulation.
Then we computed the proportion that all of these variables got into the model or were given as non- zero estimates. We were represented by non- zero estimates. Especially, of course, we were interested in how many times the null factor appeared in the selection of the models. This turns out to be 41 % of the times, which is quite high. T hese are our new false entry rates, actually. A very high percentage in which the null factor got selected.
We accounted for some upper bound, an upper 99.9 % confidence limit also that we took into account, and even a little bit higher sometimes. To be assured that the factors that you see in green that they appeared most often then in the model selection. As you will see also another example now too, where we got tied with the null factor as well. The factors or the variables in reds, they have not been selected since their occurrence is lower than the occurrence of the null factor over the different bootstrap simulations. The variables of interest here that got selected and are predictive of the epidemicity index I5 are the death rates of causes, death rate from organic heart disease, the death rates from pneumonia, death rate from cancer, death rate from measles, and the geographical position.
Now, the death rate from all causes, that's an aggregate for the death rates from the individual- specific chronic diseases. We were also interested to see what would happen if we were to take this out of the analysis. This is what we did on the following slides. We took it out, death rate all causes, and then we got a different picture. The death rate from measles and the death rate from cancer turned out to be unimportant now, whereas death rate from pneumonia got in the green zone, as well as the death rate from pulmonary tuberculosis. These were also quite highly correlated with the death rate of all causes. Death rate of all causes masked. These variables, although some of the death rates are also correlated among each other, so we have to be careful here. That was a new result that we saw.
Now, eventually, we repeated these variable selections further onwards with only the variables in green that you see here on the screen, so the ones that got selected to which we added our new 1910 census data, like the shares of ages and the illiteracy and the schooling and so forth. We did this with and without dead rate or causes to finally then select the variables that were common kinds of overall analysis. All the different analyses selected some other variables still, but still there were some variables which were present all the time, and these are the ones that we finally then contained into the model, which is our final Poisson regression model that we obtained in the end.
Which contains death rates from heart disease, organic heart disease, death rates from all causes we kept telling, as also Pearl stressed that as being quite important. Share ages, 0 to 4. School attendance of the population, 6 to 20 years of age and the geographical position. That was our final regression model for the first response, the epidemicity index I5, and below you see the results of Pearl. Having done a correlation analysis, he was able to identify pulmonary tuberculosis, organic heart disease, and acute nephritis and Bright's disease. Besides the death rate of all causes that he deemed quite important, which was actually also always the case for our analysis. It always came on top of our analysis. We kept it in.
Now, Pearl also pointed towards some specific chronic diseases, but we were not able to put these on top. We did not find them to be stably present. Overall, our analysis, so we did not identify them. Also, Pearl, actually, after his first study, got criticized because of pointing towards these individual diseases. Then in the second of his study, also, he was more prudent, more conservative, and only pointed towards the death rate of all causes and the organic diseases of the heart as the ones that are predictive for his modified index of the explosiveness of the outbreak I6.
Our final analysis is the one that you see here pointing towards also death rates from organic heart disease, death rate from all causes, the share of the ages between zero and four, and the population density rule. W e're always able to see the population density coming up high in the list of variables that occurred almost continuously over the bootstrap replicates. That also turned out to be an important variable from our analysis.
With the destructiveness variable, we did not find all that much of variation. The range actually of values was not as large as for the other two responses, the epidemicity index . We were only able to identify a few variables and specifically the death rates coming from the organic heart disease. Then we identified, besides the share of the youngest people between 0 and 4 years, also the share of people ranging between 20 and 40, so 25 and 44 years of age. The healthy people actually prior to the pandemic, that was also indicative of excess mortality rates over the different cities in the United States.
Pearl's data sets, as we could see also, we only had very little data. There were very tiny 39 observations in the first study and in the second study only 34 observations because they modified some of the variables, some observations got lost. The data are also observational with quite some multicollinearity involved. The quality of the data could have been better. Therefore, our analysis for sure are useful, but they are not magical, as well as Pearl's correlation analysis. Specifically, his first analysis, it was not supported. He knew afterwards that people did not really support his first analysis. Anyway, as George Box said, "All models are wrong, some are useful."
We were able to select satisfactory models in a sequential manner. First of all, we included Pearl's variables and retained the selected variables, to which then we added new 1910 census variables to finally select those variables that are informative and each time with and without the death rate from all causes. Then we retained the variables that popped up in the green zone to be quite predictive of the response that popped up all the time, so over all the analysis that we did to then have the models that I presented. I hope this was informative for you to listen to and many thanks.