Choose Language Hide Translation Bar

Revisiting Pearl’s Influenza Studies Using JMP® (2023-EU-30MP-1319)

Roselinde Kessels, Assistant Professor, University of Antwerp and Maastricht University
Chris Gotwalt, JMP Director of Statistical Research and Development, JMP
Guido Erreygers, Professor of Economics, University of Antwerp

 

In 1919 and 1921, Raymond Pearl published four empirical studies on the Spanish flu epidemic. He explored the factors that might explain the epidemic’s explosiveness and destructiveness in America’s largest cities. Using partial correlation coefficients, he tried to isolate the net effects of the possible explanatory factors, such as general demographic characteristics of the cities and death rates for various diseases, on the variables measuring the severity of the epidemic. In this presentation, we revisit Pearl’s data and apply variable selection with a pseudo-variable in JMP®'s Generalized Regression platform instead of Pearl’s correlation analysis. We use Poisson forward selection on the variables in the platform with AICc validation. We find that our results largely correspond with Pearl’s conclusions but contain some additional nuances that are substantive. This paper contributes to the literature showing that Poisson regression proves useful for historical epidemiology. JMP’s Generalized Regression platform offers researchers much flexibility in choosing the most appropriate model.

 

 

Okay,  welcome  everybody  to  this  presentation  in  which  I  will  talk  about  Pearl's  Influenza  Studies.  Pearl  a biostatistician,  he  lived  quite  a  while  ago  during,  specifically,  the   Spanish Flu,  and  he  was  the  first  one  to  analyze  the  weekly  data  that  was  collected  by  the  US  Bureau  of  the  Census  about  the   Spanish Flu,  which  occurred  in  1918,  1920,  and  this  for  the  large  American  cities.

Pearl,  he  was  the  first  to  do  so  to  analyze  the  data  regarding  the   Spanish Flu,  and  he  wrote  two  reports  about  it,  his  influenza  studies  one  and  two,  which  we  will  revisit  now  during  this  presentation,  and  we'll  see  how  we  will  be  able  to  look  into  the  data,  analyze  the  data  using  JMP.  This  is  joint  work  with  Chris  Gotwalt,  who  contributed  to  the  methodological  component,  and  Guido  Erreygers,  who  initiated  the  idea  of  revisiting  Pearl's  influenza  studies.

The  overview  of  this  talk  is  as  follows.  First,  I'll  discuss  a  little  bit  the   Spanish Flu,  what  it  was  all  about.  Quite  a  deadly  pandemic  at  that  time.  Then  I'll  introduce  Pearl's  Influenza  Studies,  and  I'll  talk  a  little  bit  about  the  data  that  Pearl  used  in  his  analysis.  We  have  added  some  census  data  from  1910  on  top  of  that  for  our  analysis,  which  consists  of  a  variable  selection  procedure  with  a  null  factor,  a  random  factor,  an  independent  normal  factor  in  our  analysis,  and  bootstrap  simulation  from  the  data.  Then  we'll  be  discussing  our  results  and  compare  those  results  with  Pearl's  results,  and  then  we'll  conclude.

First  of  all,  the   Spanish Flu  pandemic,  to  frame  that  a  little  bit,  it  was  one  of  the  deadliest  pandemics  in  history,  as  witnessed  by  this  list  here  showing  the  world's  deadliest  pandemics  in  overtime.   You  can  see  that  the   Spanish Flu  here  ranks  fifth  in  terms  of  the  number  of  deaths  that  it  caused.  T his  list  is  headed  by  the  Black  Deaths,  which  killed  about  four  times  more  people  than  the   Spanish Flu  did.   Then  below  this  list,  you  see  COVID- 19  pandemic  appearing,  which  we  got  all  exposed  to.   It  also  appears  in  this  list  still.

Just  as  with  COVID- 19,   gatherings  were  more  encouraged  to  happen  outside,  and  also  at  the  times  already  of  the   Spanish Flu  pandemic  there,  that  was  the  case  as  well,  and  even  for  classes  to  happen  outdoors,  as  you  can  see  on  this  photo.

The   Spanish Flu  pandemic  consisted  of  three  waves.  The  first  one  started  in  March  1918  in  the  US  and  spread  to  Europe  and to  the  rest  of  the  world.  Then  a  more  severe  wave,  the  severest  one  started  in  August  1918  in  France  and  spread  rapidly  to  the  rest  of  the  world,  as  well  as  coincided  with  the  end  of  the  First  World  War.  Then  a  third  one  which  was  less  severe  than  the  second  one,  but  more  severe  than  the  initial  wave,  started  at  the  early  beginning  of  1919  and  hit  some  specific  countries  like  Australia  and  Great  Britain.

Here  you  see  the  timeline  of  the  three  waves  of  the   Spanish Flu  pandemic  occurring  in  the  US,  more  specifically  because  we  have  Pearl's  data  from  the  US  which  we  look  into.  The  death  toll  was  humongous  and  most  deaths  occurred  actually  in  India  and  China.

What  was  specific about  the   Spanish Flu  pandemic  was  that  many  young  healthy  adults  died  as  shown  by  a  W  pattern  of  mortality  rates,  which  is  here  shown  by  the  full  black  line.   In  contrast  to  the  U  shape  of  the  normal  seasonal  influenza  and  pneumonia  mortality  rates  which  were  registered  prior  to  the  pandemic.

For  seasonal  influenza  and  pneumonia,  this  U  shape  shows  that  individuals  younger  than  four  years  of  age  and  people  older  than  75  years,  that  these  were  most  hit  by  seasonal  influenza  and  pneumonia.   Characteristic  of  the   Spanish Flu  pandemic  was  that  besides  those  two  age  groups,  also  the  young  adults  in  the  range  between  20  and  40  years  of  age  got  specifically  hit  by  this  pandemic,  eventually  leading  up  to  a  huge  death  toll.

Then  further  onwards  throughout  history,  epidemiologists  and  historians  of  epidemiology  have  applied  statistical  analysis  to  the   Spanish Flu  pandemic.  More  specifically,  people  worked  on  the  US  as  well,  namely,  Markel  and  colleagues,  they  studied  the  effect  of  non- pharmaceutical  interventions  in  American  cities  like  school  closure,  cancelations  of  public  gatherings.

Mamelund  studied  the  influence  of  geographical  isolation  of  specific  areas  and  specific  regions  in  the  US.   Clay,  Lewis,  and  Severnini  studied  cross- city  variation  of  438  US  cities,  but  also  elsewhere  around  the  world,  so  not  only  in  the  US  but  also  in  Europe,  like  in  Norway  and  Spain.  Data  about  the   Spanish Flu  pandemic were  analyzed  together  with  census  data  and  so  forth.   The   Spanish Flu  pandemic  kept be  intriguing  for  many  researchers  over  time.

About  Pearl's  influenza  studies  now,  here  you  see  the  first  of  his  influenza  study,  the  beginning  of  it.   First,  he  talks  about  the  severity  of  the  outbreak  and  actually  that  the  death  toll  for  the  United  States  was  set  at  about  550,000,  which  is  about  five  times  more  than  the  number  of  people  who  actually  died  during  the  First  World  War.  Hence,  showing  the  severity  of  this  pandemic  and  actually  how  deadly  it  was,  how  explosive  it  was.  That  was  specifically  what  Pearl  was  interested  in  examining  and  relating  to  other  variables.

The  second  of  his  influenza  studies  looks  as  follows,  so  the  beginning  at  least.   That's  an  update  of  the  first  of  his  studies  because  he  got  some  criticism  after  first  study  by  some  peers  who  criticized  him  for  not  being  accurate  with  certain  variables.  He  had  issues  with  what  we  now  refer  to  as  construct  validity.  Some  of  the  data  were  not  really  measuring  what  they  were  supposed  to  measure,  so  they  were  not  so  accurate  and  they  could  be  defined  more  appropriate,  more  accurate  in  order  that  they  really  measure  what  they  should  measure.   He  did  that.  He  tackled  the  data  again,  the  variables.  Set  up  new  definitions  of  the  variables  and  moved  forward.

The  data  for  the  first  of  his  studies  looks  as  follows.  We  have  data  from  Pearl  of  39  American  cities.   The  response  variable  he  wanted  to  study  was  the  epidemicity  index,  which  is  given  by  the  symbol  I 5  in  his  studies.  That  was  the  first  response  variable  he  wanted  to  study  and  he  wanted  to  also  find  other  variables  like  demographics  that  could  be  predictive  of  this  response  variable.

Now,  this  response  variable  was  a  measure  for  the  explosiveness  of  the  outbreak  in  terms  of  epidemic  mortality,  so  how  explosive  the  outbreak  was  and  hitting  the  various  cities  in  the  US.   He  defined  the  peak  time  ratio,  so  the  peak  of  the  excess  mortality  rates  divided  by  the  peak  date.   In  this  way,  he  wanted  to  compare  cities  with  one  single  very  sharp  peak  to  cities  with  a  long  flattened  curve  of  excess  mortality  rates.   He  devised  this  epidemicity  index  himself.

He  wanted  to  relate  this  epidemicity  index  to  various  factors  like  the  demographics.  First  of  all,  the  population  density  for  1916,  and  then  the  geographical  position,  which  is  a  straight  line  distance  from  Boston.  He  also  included  the  age  distribution,  chi- square,  which  is  an  age  constitution  index  showing  the  deviation  in  the  age  distribution  of  the  cities  from  a  fixed  standard  age  distribution.  He  also  studied  the  percentage  population  growth  in  the  decade  1900,  1910.

Besides  the  demographics,  he  also  involved  the  death  rates  for  1916,  so  prior  to  the  pandemic.  First  of  all,  a n  aggregate  death  rate  from  all  causes  and  then specific diseases,   death  rates  for  specific  diseases,  namely  pulmonary  tuberculosis,  organic  heart  disease,  acute  nephritis  and  Bright's  disease,  or  failure  of  the  kidneys,  influenza,  pneumonia,  typhoid  fever,  cancer,  and  measles.  There  was  quite  some  correlation  between  these  death  rates.

I already  told  you  a  little  bit  about  the  response  variable  that Pearl  developed,  namely  the  epidemicity  index.  That  did  not  arise  into  just  a  single  attempt,  but  he  started  actually  from  I 1.   First  of  an  epidemicity  index  that  he  further  improved  into  I 2,  I 3,  I 4,  and  then  he  was  happy  enough  to  move  along  and  to  work  with  the  I 5  epidemic ity  index.   To  distinguish  then  cities  with  single,  very  sharp  peaks  to  those  with  long,  low  flat  curves  of  epidemic  mortality.

He  updated  the  variables  from  his  first  study  and  the  second  study,  and  in  that  sense,  he  also  updated,  he  also  modified  the  epidemicity  index  and  the  new  epidemicity  index.  He  referred  to  that  one  as  I 6.   As  another  variable  of  interest,  he  also  defined  to  this  destructiveness  variable  as  containing  excess  mortality  rates.  Then  he  wanted  to  relate  these  two  responses  in  his  second  study  with  the  normal  death  rates,  which  were  the  mean  death  rates  over  the  three  years,  1915  to  1917.  Then  he  also  brought  in  the  demographics.  Again,  he  modified  the  age  constitution  index  based  on  the  1910  census  data,  and  he  used  that.

Then  also  he  involved  the  sex  ratio  of  the  population  of  males  versus  females.  The  population  density  of  1910  is  used  in  the  first  of  his  studies,  remained  in  the  second  study.  Then  instead  of  a  geographical  position,  he  used  latitude  and  longitude  in  his  second  study.  Then  lastly,  he  used  the  percentage  population  growth  in  the  decade  1900,  1910,  again.

These  were  Pearl's  data  to  which  we  added  some  additional  census  data  from  1910  which  are  given  over  here.  Instead  of  the  age  constitution  index,  we  used  the  share  of   the  different  age  groups,  and  the  pure  numbers  actually,  because  we  were  not  really  happy  with  the  way  Pearl  defined  the  age  constitution  index.  It  was  really  quite  complex  and  not  clear  enough  so  that  we  thought  to  just  go  about  and  use  the  1910  share  ages  instead  of  this  age  constitution  index  developed  by  Pearl.

Besides  that,  we  looked  into  the  number  of  persons  to  a  dwelling,  the  percentage  of  homes  owned,  the  school  attendance  of  population,  6  to  20  years  of  age,  and  the  illiteracy  in  the  population,  10  years  of  age  and  over,  and  see  whether  one  of  these  factors  or  some  of  these  factors  could  be  predictive  of the  tree  response  variables  in  Pearl's  studies.

What  was  Pearl's  analysis  all  about?  Well,  he  got  into  multiple  correlation.  He  studied  all  the  data  making  use  of  partial  correlation  coefficients  as  well  as  the  normal  correlation  coefficients  of  zero  order,  but  he  did  that  very  rigorously.  He  had  computed  this  all  by  hand  and  did  this  quite  well  actually  and  took  into  account  various  other  factors  in  these  partial  correlations  by  having  those  other  variables  being  constant.   For  the  partial  correlations  and  the  other  correlations,  he  also  computed  probable  errors  to  find  out  whether  these  correlation  coefficients  were  significant  or  not,  so  he  did  this  quite  well.

Now,  the  analysis  that  we  are  using  is  the  one  in  which  we  are  going  to  actually  select  variables  with  a  null  factor  and  we  are  going  to  bootstrap  our  data.  We  are  doing  so  because  our  P  values  for  our  data,  which  are  unfortunately  not  orthogonal,  they  can   become  heavily  biased.  Since  we  are  not  really  using  nicely  orthogonal  data,  the  P  values  can  become  quite  biased  towards  zero.  It's  always  the  danger  with  P  values  for  observational  data  that  unimportant  factors  all  of  a  sudden  become  important  and  that  the  type  one  error  rate  is  not  under  control.

To   solve  the  fact  that  the  P  values  are  not  uniformly  distributed  anymore  in  the  case  of  an  unimportant  variable,  to  solve  that  issue,  we  are  going  to  involve  a  random  variable,  a  null  factor,  an  independent  normal  variable  into  the  analysis  and  see  to  what  extent  it  appears  in  our  procedure  of  selecting  variables.  This  idea  is  inspired  by  or  originated  from  the  JASA  paper  by  Wu,  Boos,  and  Stefanski  in  2007.

Specifically,  what  we  have  done  is,  well,  we  included  a  single  null  factor  in  the  variable  selection  procedure  and  performed  2,500  bootstrap  replicates  for  variable  selection  using  JMP.   Then  we  calculated  the  proportion  of  times  each  variable  enters  the  model.  Variables  that  enter  as  often  or  less  than  the  null  factor  are  ignorable,  and  variables  that  enter  more  often  than  the  null  factor,  well,  these  are  the  ones  that  we  are  going  to  select  as  being  predictive  of  the  response  variable.

In  JMP,  we  specified  two  new  columns,  and  actually,  one  formula  is  needed.  That's  the  formula  for  the  bootstrap  frequency  column  that  you  see  here.  The  bootstrap  frequency  column  is  based  on  our  null  factor,  which  is  an  independent  normal  variable  that's  reinitialized  each  time  during  the  bootstrap  simulation.   Based  on  this  reinitialization,  the  frequency  column  gets  updated  so  that  we  have  a  100 %  resampling  of   our  sample size of our data  with  the  fixed  sample  size  as  it  is.

Then  for  the  variable  selection,  what  kind of regression  also  would  we  apply  that  we  could  find  out  in  the  generalized  linear  model  platform  of  JMP,  because  taking  into  account  the  frequency  column  also,  if  we  do  variance  selection,  we  also  need  to  do  the  distribution  determination  at  the  same  time.  It  has  to  happen  simultaneously.  You  can't  do  it  apart  based  on  the  original  data.   Actually,  the  graph  on  the  left- hand  side  is  a  little  bit  misleading  because  that  graph  contains  the  original  data,  but  distribution  determination  you  should  actually  do  while  doing  the  variable  selection  itself.

Assuming  a  Poisson  distribution,  well,  that  assumption  was  not  rejected,  so  we  could  actually  move  forward  with  the  assumption  of  a  Poisson  distribution,  which  is  also  maintained  in  the  literature  throughout  for  mortality  rates  and  so  forth  for  analysis.  It  was  not  really  rejected  and  it  was  good  to  assume  such   a  Poisson  regression  for  the  analysis.

Then  also  we  applied  Poisson  regression  in  combination  with  variable   selection  for  the  epidemicity  index  I6.  However,  we  switched  to  normal  regression  or  less  regression  for  the  third  response  variable,  the  destructiveness  variable.

The  way  to  apply  this  regression  in  JMP  by  means  of  variable  selection  is  by  using  the  generalized  regression  platform  where  we  then  define  the  response  variable  for  analysis  and  put  the  frequency  column  into  the  freq  box.   Then  as  model  terms  in  the  construct  model  effects  window,  we  involve  all  the  variables,  all  Pea rl's  variables,  together  also  with  our  null  factor  or  independent  normal  variable  that  we  also  included.

Then  we  move  forward  with  the  regression  procedure  by  selecting  forward  selection  estimation  method.  As  a  criterion,  we  used  the  Akaike  information  criteria  with  the  correction  for  small  sample  sizes  to  decide  upon  the  final  model  to  include  the  final  model  then  to  select  with  the  selection  of  the  variables.

The  solution  part  that  you  see  here  is  based  on  normalized  data  to  put  them  on  the  same  scales, so scales  and  center  data  and  also  to  diminish  the  effect  of  multicollinearity  in  the  data  because  there's  quite  some.   Then  you  see  again,  the  original  predictors  popping  up  in  the  lower  output.   Then  we  select  a  model  with  the  lowest  Akaike  information  criteria.  One  after  the  other,  the  variables  coming  being  selected  by  forwards  variable  selection  based  on  the  Akaike  information  criteria.  As  you  can  see  in  this  output  actually  here,  the  null  factor  got  into  the  model.  The  null  factor  completely  unimportant,  uninformative,  so  got  into  the  model,  which  is  not  good,  of  course.

We  had  to  do  this  estimation  method,  this  variable  selection  procedure  2,500  times,  and  we  did  this  in  JMP  by  right  clicking  on  the  estimate  column  and  then  hitting  simulate  and  then  selecting  the  frequency  column  as  a  column  to  switch  in  and  out   between  the  different  bootstrap  replicates  so  that  initializations  of  the  null  factor  were  always  being  guaranteed  between  the  different  bootstrap  replicates.  Then  finally,  so we  got  the  2,500  model  selections  out  of  the  bootstrap  simulation.

Then  we  computed  the  proportion  that  all  of  these  variables  got  into  the  model  or  were  given  as  non- zero  estimates.  We  were  represented  by  non- zero  estimates.  Especially,  of  course,  we  were  interested  in  how  many  times  the  null  factor  appeared  in  the  selection  of  the  models.  This  turns  out  to  be  41 %  of  the  times,  which  is  quite  high. T hese are  our  new  false  entry  rates,  actually.   A  very  high  percentage  in  which  the  null  factor  got  selected.

We  accounted  for  some  upper  bound,  an  upper  99.9 %  confidence  limit  also  that  we  took  into  account,  and  even  a  little  bit  higher  sometimes.  To  be  assured  that  the  factors  that  you  see  in  green  that  they  appeared  most  often  then  in  the  model  selection.  As  you  will  see  also  another  example  now  too,  where  we  got  tied  with  the  null  factor  as  well.  The  factors  or  the  variables  in  reds,  they  have  not  been  selected  since  their  occurrence  is  lower  than  the  occurrence  of  the  null  factor  over  the  different  bootstrap  simulations.  The  variables  of  interest  here  that  got  selected  and  are  predictive  of  the  epidemicity  index  I5  are  the  death  rates  of  causes,  death  rate  from  organic  heart  disease,  the  death  rates  from  pneumonia,  death  rate  from  cancer,  death  rate  from  measles,  and  the  geographical  position.

Now,  the  death  rate  from  all  causes,  that's  an  aggregate  for  the  death  rates  from  the  individual- specific  chronic  diseases.   We  were  also  interested  to  see  what  would  happen  if  we  were  to  take  this  out  of  the  analysis.   This  is  what  we  did  on  the  following  slides.   We  took  it  out,  death  rate  all  causes,  and  then  we got  a  different  picture.  The  death  rate  from  measles  and  the  death  rate  from  cancer  turned  out  to  be  unimportant  now,  whereas  death  rate  from  pneumonia  got  in  the  green  zone,  as  well  as  the  death  rate  from  pulmonary  tuberculosis.   These  were  also  quite  highly  correlated  with  the  death  rate  of  all  causes.  Death  rate  of  all  causes  masked.  These  variables,  although  some  of  the  death  rates  are  also  correlated  among  each  other,  so  we  have  to  be  careful  here.   That  was  a  new  result  that  we  saw.

Now,  eventually,  we  repeated  these  variable  selections  further  onwards  with  only  the  variables  in  green  that  you  see  here  on  the  screen,  so  the  ones  that  got  selected  to  which  we  added  our  new  1910  census  data,  like  the  shares  of  ages  and  the  illiteracy  and  the  schooling  and  so  forth.  We  did  this  with  and  without  dead  rate  or  causes  to  finally  then  select  the  variables  that  were  common  kinds of  overall  analysis.   All  the  different  analyses  selected  some  other  variables  still,  but  still  there  were  some  variables  which  were  present  all  the  time,  and  these  are  the  ones  that  we  finally  then  contained  into  the  model,  which  is  our  final   Poisson  regression  model  that  we  obtained  in  the  end.

Which  contains  death  rates  from  heart  disease,  organic  heart  disease,  death  rates  from  all  causes  we  kept  telling,  as  also  Pearl  stressed  that  as  being  quite  important.  Share  ages,  0  to  4.  School  attendance  of  the  population,  6  to  20  years  of  age  and  the  geographical  position.   That  was  our  final  regression  model  for  the  first  response,  the  epidemicity  index  I5,  and  below  you  see  the  results  of  Pearl.  Having  done  a  correlation  analysis,  he  was  able  to  identify  pulmonary  tuberculosis,  organic  heart  disease,  and  acute  nephritis  and  Bright's  disease.  Besides  the  death  rate  of  all  causes  that  he  deemed  quite  important,  which  was  actually  also  always  the  case  for  our  analysis.  It  always  came  on  top  of  our  analysis.   We  kept  it  in.

Now,  Pearl  also  pointed  towards  some  specific  chronic  diseases,  but  we  were  not  able  to  put  these  on  top.  We  did  not  find  them  to  be  stably  present.  Overall,  our  analysis,  so  we  did  not  identify  them.  Also,  Pearl,  actually,   after  his  first  study,  got  criticized  because  of  pointing  towards  these  individual  diseases.  Then  in  the  second  of  his  study,  also,  he  was  more  prudent,  more  conservative,  and  only  pointed  towards  the  death  rate  of  all  causes  and  the  organic  diseases  of  the  heart  as  the  ones  that  are  predictive  for  his  modified  index  of  the  explosiveness  of  the  outbreak  I6.

Our  final  analysis  is  the  one  that  you  see  here  pointing  towards  also  death  rates  from  organic  heart  disease,  death  rate  from  all  causes,  the  share  of  the  ages  between  zero  and  four,  and  the  population  density rule.  W e're  always  able  to  see  the  population  density  coming  up  high  in  the  list  of  variables  that  occurred  almost  continuously  over  the  bootstrap  replicates.  That  also  turned  out  to  be  an  important  variable  from  our  analysis.

With  the  destructiveness  variable,  we  did  not  find  all  that  much  of  variation.  The  range  actually  of  values  was  not  as  large  as  for  the  other  two  responses,  the  epidemicity  index .  We  were  only  able  to  identify  a  few  variables  and  specifically  the  death  rates  coming  from  the  organic  heart  disease.  Then  we  identified,  besides  the  share  of  the  youngest  people  between   0 and   4 years,  also  the  share  of  people  ranging  between  20  and  40,  so  25  and  44  years  of  age.  The  healthy  people  actually  prior  to  the  pandemic,  that  was  also  indicative  of  excess  mortality  rates  over  the  different  cities  in  the  United  States.

Pearl's  data  sets,  as  we  could  see  also,  we  only  had  very  little  data.  There  were  very  tiny  39  observations  in  the  first  study  and  in  the  second  study  only  34  observations  because  they  modified  some  of  the  variables,  some  observations  got  lost.   The  data  are  also  observational  with  quite  some  multicollinearity  involved.  The  quality  of  the  data  could  have  been  better.  Therefore,  our  analysis  for  sure  are  useful,  but  they  are  not  magical,  as  well  as  Pearl's  correlation  analysis.  Specifically,  his  first  analysis,  it  was  not  supported.  He  knew  afterwards  that  people  did  not  really  support  his  first  analysis.   Anyway,  as  George  Box  said,  "All  models  are  wrong,  some  are  useful."

We  were  able  to  select  satisfactory  models  in  a  sequential  manner.  First  of  all,  we  included  Pearl's  variables  and  retained  the  selected  variables,  to  which  then  we  added  new  1910  census  variables  to  finally  select  those  variables  that  are  informative  and  each  time  with  and  without  the  death  rate  from  all  causes.  Then  we  retained  the  variables  that  popped  up  in  the  green  zone  to  be  quite  predictive  of  the  response  that  popped  up  all  the  time, so   over  all  the  analysis  that  we  did  to  then  have  the  models  that  I  presented.   I  hope  this  was  informative  for  you  to  listen  to  and  many  thanks.